## 1. Introduction

In response to the challenges of verifying convection-resolving precipitation forecasts, several advanced methods now account for near misses (Zepeda-Arce et al. 2000; Ebert and McBride 2000; Casati et al. 2004; Nachamkin 2004; Theis et al. 2005; Davis et al. 2006; Roberts and Lean 2008). Of the methods developed, neighborhood, or fuzzy (Ebert 2008) approaches have been used in a number of studies (e.g., Söhne et al. 2008; Nachamkin et al. 2009; Zacharov and Rezacova 2009; Schwartz et al. 2010; Weusthoff et al. 2010; Romine et al. 2013; Mittermaier et al. 2013; Stratman et al. 2013; Lockhoff et al. 2014) in part because the concept is relatively straightforward to understand and apply. The premise involves selecting a specific threshold and either filtering or collecting samples at a series of increasing scales centered at each forecast point (Roberts and Lean 2008). As the sample, or neighborhood, scale increases the fraction of observed and predicted points exceeding the threshold contained within each sample begins to converge. At large scales the difference between the observed and predicted fractions converges toward the overall forecast bias. Forecast quality is measured by the rate of convergence. Good forecasts that either overlap or are very close to the observations converge more rapidly than forecasts with poor agreement.

The neighborhood method is a consistent means to evaluate probabilistic forecasts since outlooks are often issued as probabilities over a region. Neighborhood verification also illustrates the tradeoff between accuracy and precision. Precise forecasts often suffer from spatial and temporal errors while broad, more general forecasts covering large areas tend to be more accurate but less descriptive. For a given forecast, choosing how to draw the line between precision and accuracy is a difficult problem requiring information about the variation of sharpness, resolution, and reliability with scale. Ben Bouallègue and Theis (2014) show that the reliability and resolution components of the fractions Brier score as well as forecast sharpness (Mason 2004) provide insight to which scales provide the best compromise. Plotting the fractions on a reliability diagram is also graphically useful. Larger neighborhoods result in reliability traces with slopes closer to the optimal 45° angle, but reductions in sharpness and resolution result in those lines no longer spanning the full range between zero and one (Ben Bouallègue and Theis 2014). An important distinction between the pragmatic approach employed by Theis et al. (2005) and Ben Bouallègue and Theis (2014) versus the approach of Roberts and Lean (2008) is that the neighborhood samples are only applied to the forecasts in the pragmatic method. The observations remain valid at a single central point. The Roberts and Lean (2008) approach applies neighborhood samples to both the forecasts and the observations. Duc et al. (2013) used the Roberts and Lean approach directly to derive both reliability and relative operating characteristics (ROC) diagrams in the spatial as well as the temporal dimension. Like Ben Bouallègue and Theis (2014), they found that forecast reliability increased with increasing neighborhood size.

In this work, the neighborhood sampling concept is applied as a diagnostic tool to a set of forecasts and observations that are described in section 2. The goal is to use the distributions of the neighborhood fractions in conjunction with the summary scores to better understand the source of forecast error. Since these data are already being collected for the calculation of the FSS, the cost in terms of run time is minimal. The relative simplicity of the concept behind the FSS as well as its ease in application and low cost compels the question of how far the concept can be extended to collect additional information. More detailed statistics regarding spatial or pattern bias can be obtained using other techniques, but these methods can sometimes be cumbersome to apply and interpret. At resolutions on the order of a few kilometers or less the model error often contains large random components that can muddle attempts to catalogue systematic bias. In some cases the forecasts must be filtered to remove excess detail. Since the indicator fields are smoothed, as opposed to the forecasts, when computing the FSS, the statistics retain information about event occurrence frequencies that are not retained by smoothing the forecasts.

To carry out the diagnostics, data from the full set of fractions was mined for useful information. Such direct investigation of the fractions distributions revealed the need for modifications to the sampling methods of Roberts and Lean, which are described in section 3. The concept of a skillful forecast and its relation to the summary scores is discussed in section 4. In section 5, some simple diagnostics are examined that can isolate forecasts exhibiting the best combination of precision and accuracy. Through this diagnostic process, properties of both the forecasts and the observations can be inferred. These include the frequency of occurrence of extreme-coverage events as well as some general aspects of precipitation coverage such as whether areas are consolidated and large or smaller and evenly distributed.

## 2. Forecast and observation data

The Coupled Ocean–Atmosphere Mesoscale Prediction System (COAMPS)^{1} (Hodur 1997) was used to generate a set of high-resolution forecasts over central Florida for the period of 11–24 August 2010. The domain setup consisted of four one-way nested grids with horizontal spacings of 45, 15, 5, and 1.67 km. The finest 331 × 331 point domain is shown in Fig. 1. Forecasts were initialized daily at 0000, 0600, 1200, and 1800 UTC, using the Naval Research Laboratory’s Atmospheric Variational Data Assimilation System (NAVDAS) (Daley and Barker 2001). The previous 6-h forecast acted as a first guess. Boundary conditions were supplied from the Navy Operational Global Atmospheric Prediction System (NOGAPS) (Hogan et al. 2002) at 3-h intervals using a Davies (1976) scheme. The explicit microphysics was parameterized using a modified version of the single-moment bulk scheme of Rutledge and Hobbs (1983, 1984) described by Schmidt (2001) and Chen et al. (2003). For this study the 6- and 12-h precipitation forecasts initialized at 0000 and 1200 were validated on the 1.67-km domain for a total of 28 realizations. At each lead time the 1-h accumulated precipitation was verified against the 4-km stage-IV 1-h precipitation analyses interpolated to the 1.67-km model domain. The 0.1-mm threshold was chosen for the neighborhood method to represent the rain–no-rain interface. An example template from the 6-h forecast initialized at 1200 UTC 24 August is shown in Fig. 1.

## 3. A closer look at the neighborhood approach

### a. Skill

*P*

_{fcst}and

*P*

_{obs}are the forecast and observed fractions and

*N*is the number of neighborhoods at each scale. The sum in the numerator in (1) is the fractions Brier score (FBS; Roberts 2005), which is a measure of the squared difference between the predicted and observed fractions. The FBS for a perfect forecast is 0 but the upper limit depends on the event frequency. Roberts and Lean normalized the FSS with the FBS

_{worst}, which is the largest possible FBS in the absence of overlap of the nonzero observed and forecast fractions as defined by the sums in the denominator of (1). This choice of a reference forecast is a bit unusual because it directly depends on the forecast being verified. Typically an independent low-skill forecast like climatology is used (Wilks 2006). However, using FBS

_{worst}endows the FSS with two very useful properties. First, it constrains the FSS to values between 0 and 1, and second it allows the FSS to be symmetric with respect to the fractional bias defined by

*P*

_{fcst}/

*P*

_{obs}. With the exception of very small scales (Mittermaier and Roberts 2010), the FSS of a forecast of a fractional bias of

*b*will be identical to the FSS from a forecast with a reciprocal fractional bias 1/

*b*. These properties allow for consistent, unbiased intercomparisons of multiple forecasts that are independent of climatology, which in many cases is unknown. If desired, comparisons with other low-skill forecasts such as persistence are facilitated by calculating the FSS of those forecasts and comparing it with the FSS from the forecast in question. Another measure of acceptable skill was defined by FSS

_{useful}= 0.5 +

*f*

_{o}/2, where

*f*

_{o}represents the fraction of observed points exceeding the threshold over the domain. The magnitude of FSS

_{useful}is halfway between FSS

_{random}=

*f*

_{o}and FSS

_{perfect}= 1.0. The implications of this definition and its relationship to the decomposition of the Brier skill score will be discussed in detail in section 4.

One caveat of note is that the fractional bias is not symmetric in terms of forecast impact. For example, a forecast with a fractional bias of 0.5 will, in the best-case scenario of full overlap, result in bad forecasts for an area covering 50% of the size of the observed event. If the fractional bias is increased to 2.0, the area covered by incorrect forecasts doubles to 100% of the event size.

### b. Sampling method

An often-overlooked aspect of the FSS is the process used to generate the fractions. Roberts and Lean (2008) collect samples at every point in the forecast domain regardless of the neighborhood size. As a result, neighborhoods near the boundary contain missing data. The code that was implemented for this work explicitly calculates the fractions at each scale. At the boundaries, any missing data within each neighborhood are treated as correct negative forecasts since the sums in (1) are only incremented for points greater than the event threshold. Thus, the fractions will likely be smaller than they would have been had data existed outside the boundaries. In an effort to gauge the effects of missing data on the FSS, as well as the distribution of the fractions, an experiment was performed where all neighborhood samples were required to stay within the grid bounds. As a result, fewer neighborhood samples were taken at large scales, and those samples were preferentially located toward the grid center.

The effects of sampling on the FSS are generally minor, but can vary depending on forecast quality. The FSS, FBS, and FBS_{worst} for two separate hourly precipitation forecasts are displayed in Fig. 2. A good forecast (Fig. 2a) is characterized by a very low FBS owing to the close proximity or overlap of the forecasts and observations. In this example, the FBS does not change appreciably regardless of the sampling because most neighborhoods contain similar counts of forecasts and observations which result in very small Brier scores. However, FBS_{worst} is much smaller when the boundary samples are included because all points are summed regardless of overlap. Contributions from neighborhoods with missing data reduce the magnitude when normalized by number of neighborhoods *N*. The FSS is little changed because the small FBS and large spread between the FBS and FBS_{worst} make the FSS relatively insensitive to variations in the reference forecast. In contrast, the FSS for the bad forecast (Fig. 2b) is more sensitive to the boundary samples because the magnitude of the FBS is larger, and the ratio of the FBS to FBS_{worst} is closer to 1. Fewer points overlap so both the FBS and FBS_{worst} are reduced by the boundary samples. For the larger neighborhoods, beyond 181 points in Fig. 2b, more forecasts are considered as overlaps so the FBS begins to lose its sensitivity to the boundary samples while the FBS_{worst} continues to be impacted. The result is a noticeable reduction in the FSS at large scales.

The FSS, FBS, and FBS_{worst} graphed as a function of neighborhood scale for (a) a good forecast (6 h, valid 1800 UTC 24 Aug) and (b) a bad forecast (12 h, valid 0000 UTC 25 Aug). Dotted lines represent scores calculated from neighborhoods sampled at all points, including those at the boundaries; while solid lines represent scores calculated only from neighborhoods lying fully within the domain.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

The FSS, FBS, and FBS_{worst} graphed as a function of neighborhood scale for (a) a good forecast (6 h, valid 1800 UTC 24 Aug) and (b) a bad forecast (12 h, valid 0000 UTC 25 Aug). Dotted lines represent scores calculated from neighborhoods sampled at all points, including those at the boundaries; while solid lines represent scores calculated only from neighborhoods lying fully within the domain.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

The FSS, FBS, and FBS_{worst} graphed as a function of neighborhood scale for (a) a good forecast (6 h, valid 1800 UTC 24 Aug) and (b) a bad forecast (12 h, valid 0000 UTC 25 Aug). Dotted lines represent scores calculated from neighborhoods sampled at all points, including those at the boundaries; while solid lines represent scores calculated only from neighborhoods lying fully within the domain.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

Eventually, both sampling methods will result in FSS values that approach the largest possible grid-total asymptotic FSS. For square domains, the asymptotic FSS is reached at neighborhood *N* when the boundary neighborhoods are not included and at neighborhood 2*N* − 1 when they are. For rectangular domains the asymptotic FSS is not quite attained without the boundary neighborhoods. However, it is analytically calculable from Eq. (8) in Roberts and Lean (2008).

The degree that the FSS values are affected by the sampling method over a large set of forecasts will depend on the quality of those forecasts. For the 6- and 12-h forecasts verified in this study, Fig. 3 indicates that including the boundaries noticeably reduced the FSS and appeared to make it less sensitive to variations in forecast quality. At 325 points (~543 km), both FSS values were about 10% lower and closer to one another as a result of including the boundaries.

FSS scores for all forecasts during the 2-week period from 11 to 24 Aug 2010 plotted for the 6- (FSS6) and 12-h (FSS12) forecasts. Dotted lines represent scores calculated from neighborhoods sampled at all points, including those at the boundaries; while solid lines represent scores calculated only from neighborhoods lying fully within the domain.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

FSS scores for all forecasts during the 2-week period from 11 to 24 Aug 2010 plotted for the 6- (FSS6) and 12-h (FSS12) forecasts. Dotted lines represent scores calculated from neighborhoods sampled at all points, including those at the boundaries; while solid lines represent scores calculated only from neighborhoods lying fully within the domain.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

FSS scores for all forecasts during the 2-week period from 11 to 24 Aug 2010 plotted for the 6- (FSS6) and 12-h (FSS12) forecasts. Dotted lines represent scores calculated from neighborhoods sampled at all points, including those at the boundaries; while solid lines represent scores calculated only from neighborhoods lying fully within the domain.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

While these effects on the FSS are relatively minor at small scales, they may be an issue at larger scales. The scale at which the impacts become noticeable will depend on domain size. The small domain in this study resulted in greater impacts at relatively small scales. Other studies with larger domains may not be significantly affected at most scales that are considered useful. For this study we elected to perform the calculations using only those neighborhoods that did not exceed the grid bounds. Adding the boundary neighborhoods resulted in significant overcounts of clear and low-fraction neighborhoods, and the aliased samples biased the distributions of the fractions. These distributions were important to the statistics considered in the next two sections.

## 4. The attributes diagram and the concept of useful skill

The concept of FSS_{useful} is sometimes difficult to grasp, especially as forecast sharpness and resolution drop with scale. Reliability or attributes diagrams (Hsu and Murphy 1986) convey sharpness and resolution by portraying the degree of forecast agreement as a function of probability, or in this case, fractional coverage. Attributes diagrams (Fig. 4) are extensions of reliability diagrams that include reference lines related to the algebraic decomposition of the Brier score and the Brier skill score (Wilks 2006). Forecasts falling along the no-resolution line, represented by the horizontal line at the mean observed frequency in Fig. 4, are unable to discriminate whether events are more or less likely to occur than their mean occurrence frequency. Random forecasts fall along this line. Forecasts above a second line, represented by the diagonal midway between the no-resolution and perfect reliability lines, are said to be skillful. This skill is defined from the Brier skill score decomposition as forecasts with positive contributions from the resolution term that are greater than or equal to the negative contributions from the reliability term. These forecasts have relatively low conditional bias and are moderately able to depict the correct probability of an event occurring.

(top) Attributes diagram depicting binned 6-h forecast coverage fractions for the full 2-week period against the corresponding mean observed fraction. Colors and FSS values for each neighborhood scale are indicated in the legend; FSS_{useful} = 0.543. (bottom) Normalized forecast frequencies for each neighborhood scale.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

(top) Attributes diagram depicting binned 6-h forecast coverage fractions for the full 2-week period against the corresponding mean observed fraction. Colors and FSS values for each neighborhood scale are indicated in the legend; FSS_{useful} = 0.543. (bottom) Normalized forecast frequencies for each neighborhood scale.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

(top) Attributes diagram depicting binned 6-h forecast coverage fractions for the full 2-week period against the corresponding mean observed fraction. Colors and FSS values for each neighborhood scale are indicated in the legend; FSS_{useful} = 0.543. (bottom) Normalized forecast frequencies for each neighborhood scale.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

To construct the diagrams in this study, the forecast coverage fractions were binned for 10 categories (0 ≤ *P* < 0.1, …, 0.9 ≤ *P* ≤ 1.0) and plotted against the mean observed fraction for each category. This process was repeated for each neighborhood scale. The result is a graphical representation of the changes in forecast reliability, resolution, and sharpness with increasing neighborhood size.

The attributes diagram for all of the 6-h forecasts over the 2-week period (Fig. 4) indicates that the forecasts approached the no-skill line at the 182.03-km scale (109 grid points). At that scale, no forecast fraction exceeded 0.80 while the corresponding maximum average observed fraction was 0.62 reflecting a loss in forecast sharpness. The FSS at that scale was 0.502, which is relatively close to FSS_{useful} = 0.543. Some agreement between these skill measures should be expected as both are defined similarly as lying midway between no skill and perfect skill. The degree of agreement depends on the distribution of the forecast fractions. The FSS is a single value weighted by the number of forecasts at each fraction, but the lines in the attributes diagram are depicted equally for each bin. Although forecasts from neighborhoods with large fractions performed skillfully at the 121.91-km (73 grid point) scale their occurrences were too infrequent to significantly influence the FSS. In that regard, the attributes diagrams can be useful at depicting the forecast performance for rare events.

The attributes diagram for the 12-h forecasts (Fig. 5) depicts poorer performance than might be expected from the FSS alone. The 6- and 12-h FSS values are similar (Fig. 3), but the forecast resolution and reliability are much lower at 12 h (Figs. 4 and 5). At both lead times the FSS was primarily influenced by contributions from large numbers of low-fraction neighborhoods. Performance was worse for high-fraction neighborhoods at 12 h, but small occurrence frequencies limited their influence on the FSS. However, the agreement between the FSS_{useful} = 0.542 and the no-skill line is visually apparent at 12 h. The 242.15-km (145 point) neighborhood straddles the no-skill line, and the FSS at this scale is 0.542. Since the observed distributions are not displayed on the attributes diagram the reasons behind the reduced performance at 12 h are not readily apparent. Those issues can be addressed by directly comparing the distributions.

As in Fig. 4, except the 12-h forecasts and FSS_{useful} = 0.542.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 4, except the 12-h forecasts and FSS_{useful} = 0.542.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 4, except the 12-h forecasts and FSS_{useful} = 0.542.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

## 5. Investigating the fractions

The distributions of the neighborhood fractions convey information about the coverage frequency at each scale that can be used to describe some basic features of the precipitation patterns. Here, the fractions diagram is introduced as a means to display the progression of the neighborhood fractions in a consolidated format. The diagram consists of the normalized unconditional distributions of the observed and forecast fractions at each scale displayed in adjoining panels. These distributions are derived by summing the total number of fractional neighborhoods for each coverage bin at each scale and normalizing by the total number of samples through the period of interest. The fractions diagrams for the 6-h forecasts covering the full 2-week period are displayed in Fig. 6. The forecast distributions in the right panel display the same information as the lower panel of Fig. 4, except in a rearranged format and with the inclusion of all available scales in addition to the five selected in that figure. The corresponding distribution of observed fractions is displayed in the left panel of Fig. 6. At the single-point scale (1.67 km), only two bins are occupied since the fractions can only attain values of 0 or 1. The colors represent the relative frequencies of each outcome. For the period in this study, observed precipitation covered 8.6% of the domain on average, leaving 91.4% clear. The forecast precipitation coverage was 9.7%, slightly higher than the observations, reflecting an overall forecast fractional bias of 1.14. For neighborhood sizes greater than a single point the neighborhood fractions can attain values between 0 and 1, and thus the coverage spans the full range of bins in the diagram. Note that the values in each row sum identically to 1.0.

Fractions diagrams for all 6-h forecasts for the full 2-week period. The normalized frequencies of occurrence for (left) the observed and (right) forecast neighborhood fractions at each neighborhood size are scaled by the color bar at the lower left. The FSS scores for each scale are shown by the color bar on the right. The fractional bias, threat score (TS), equitable threat score (ETS), and the number of forecasts in the statistics (NDTG) are indicated at the lower right.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

Fractions diagrams for all 6-h forecasts for the full 2-week period. The normalized frequencies of occurrence for (left) the observed and (right) forecast neighborhood fractions at each neighborhood size are scaled by the color bar at the lower left. The FSS scores for each scale are shown by the color bar on the right. The fractional bias, threat score (TS), equitable threat score (ETS), and the number of forecasts in the statistics (NDTG) are indicated at the lower right.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

Fractions diagrams for all 6-h forecasts for the full 2-week period. The normalized frequencies of occurrence for (left) the observed and (right) forecast neighborhood fractions at each neighborhood size are scaled by the color bar at the lower left. The FSS scores for each scale are shown by the color bar on the right. The fractional bias, threat score (TS), equitable threat score (ETS), and the number of forecasts in the statistics (NDTG) are indicated at the lower right.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

The fractions diagrams indicate the 6-h forecasts correctly reproduced several basic features of the observations. The dominance of low-fraction neighborhoods reflects the limited precipitation coverage through this period. The distribution of the high-fraction neighborhoods appears to be well reproduced, even at small scales. However, the sparse occurrence of large regions of rainfall on any given day renders the high-fraction neighborhoods difficult to track using the normalized spatial frequencies alone. The problem arises from weighting each forecast equally, which favors clear days over infrequent rainy ones. High-fraction neighborhoods are better visualized using normalized temporal frequencies, which more directly track the number of neighborhoods at each fraction through a given period. The fractions diagram of the normalized rate of occurrence of at least one neighborhood in each fractional bin per forecast is shown in Fig. 7. A value of 1.0 indicates that at least one neighborhood of the specified fraction existed in all of the 28 forecasts in the sample. Multiple experiments were conducted to determine an optimal threshold value to use for the number of neighborhoods in each fractional bin per forecast. Larger values generally acted to reemphasize the clear fractions over the rainy ones because the smaller rainy areas were unable to meet the threshold criteria. Also, larger threshold values resulted in zero occurrences at large scales because the number of neighborhoods was limited by the size of the analysis grid. Fortunately, as the threshold value varied, the observed and forecast fractions distributions varied in unison. The rainy fractions that were most prominent at small thresholds gave way to mostly clear fractions at large thresholds. Since the patterns were most pronounced at low thresholds, a value of 1 was chosen to maximize the counts from the rainy neighborhoods.

As in Fig. 6, except the normalized temporal frequency of at least one neighborhood for each fraction is shown. A value of 1.0 indicates that at least one neighborhood of that fraction occurred in each of the 28 forecasts during the 2-week period.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 6, except the normalized temporal frequency of at least one neighborhood for each fraction is shown. A value of 1.0 indicates that at least one neighborhood of that fraction occurred in each of the 28 forecasts during the 2-week period.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 6, except the normalized temporal frequency of at least one neighborhood for each fraction is shown. A value of 1.0 indicates that at least one neighborhood of that fraction occurred in each of the 28 forecasts during the 2-week period.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

The most prominent aspect of Fig. 7 is the general agreement between the forecast and observed temporal frequencies, though some subtle differences do exist. At the 37-point scale (61.79 km), large-fraction neighborhoods occurred more frequently in the forecasts, indicating the model tended to generate larger precipitation elements than was observed. Correspondingly, low fraction and nearly clear neighborhoods occurred more frequently in the observations at that scale. In general, fractional coverages greater than 0.1 were more frequent in the forecast at almost all scales. These trends are consistent with the slight positive bias overall.

Although the distributions of the fractions were similar at 6 h, the low-FSS values at small scales indicate some significant forecast errors occurred. Were these a result of daily phase errors or did a few bad forecasts skew the results? The correlations between the daily frequencies of the observed and predicted fractions (Fig. 8) provide some guidance by revealing if the fractions fluctuated consistently in time. At the single-point scale the correlations are very high (0.93), indicating that at least in bulk the model was able to resolve whether the total number of precipitating points would be large or small on a given day. However, these correlation values are somewhat inflated by the large number of clear points on most days. Correlations were also relatively high for most of the large-fraction neighborhoods except for the highest coverages, which only occurred a few times during the period. Direct inspection of the fractions time series as well as visual inspection of the forecasts confirmed that COAMPS generally performed well when coverage was widespread.

Fractions diagram depicting the temporal correlation between the observed and 6-h forecast fractions through the entire forecast period (scale at bottom). All positive correlations are scaled by the color bar, negative correlations are purple. The FSS scores for each scale are shown by the color bar on the right.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

Fractions diagram depicting the temporal correlation between the observed and 6-h forecast fractions through the entire forecast period (scale at bottom). All positive correlations are scaled by the color bar, negative correlations are purple. The FSS scores for each scale are shown by the color bar on the right.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

Fractions diagram depicting the temporal correlation between the observed and 6-h forecast fractions through the entire forecast period (scale at bottom). All positive correlations are scaled by the color bar, negative correlations are purple. The FSS scores for each scale are shown by the color bar on the right.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

A distinct minimum in the temporal correlations is apparent at most scales for neighborhoods with coverages from 0.1 to 0.3. Interestingly, the correlations are worse for some of the larger-sized neighborhoods which generally perform better in terms of the FSS. These low temporal correlations reflect systematic structural differences between the predicted and observed precipitation patterns. To illustrate this effect, the fractions diagram for the single 6-h forecast in Fig. 1 is displayed in Fig. 9. The high-FSS scores along with the agreement in the patterns in Fig. 1 indicate that this forecast was relatively good. However, the observed precipitation pattern in Fig. 1a is characterized by numerous small regions of rainfall interspersed between complex shapes with numerous small appendages, while the forecast entities (Fig. 1b) are smoother and more consolidated. More importantly, the observed precipitation is more evenly distributed through the domain, especially in the southern portions. The finer, more consistent granularity in the observations permitted the observed fractions to converge toward the domain mean more rapidly with increasing scale. The effects of the discretization are magnified at medium to large neighborhoods as increasing numbers of samples are sorted into decreasing numbers of bins. Since the bias was close to 1.0 in this case both distributions eventually reached the same value at the largest scale.

As in Fig.6, but for the 6-h COAMPS forecast valid on 1800 UTC 24 Aug 2010. For this forecast, FSS_{useful} = 0.681.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig.6, but for the 6-h COAMPS forecast valid on 1800 UTC 24 Aug 2010. For this forecast, FSS_{useful} = 0.681.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig.6, but for the 6-h COAMPS forecast valid on 1800 UTC 24 Aug 2010. For this forecast, FSS_{useful} = 0.681.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

In general, the variation of the fractions diagrams with increasing scale is a complicated issue. At very small scales the distribution of the fractions is primarily determined by the grid-total rainfall coverage. As the neighborhood scale increases, the size and shape of the individual rain events increasingly influences the distribution of the fractions. At very large scales, the shape of the distributions again becomes dominated by the total coverage because the neighborhoods subtend large portions of the grid.

For the forecast in Fig. 9, the absolute difference between the observed and predicted fractions was relatively low at most scales despite the differences in the distribution widths. For other cases, the agreement was much lower. Of the 14 6-h forecasts where precipitation covered greater than 5% of the domain, 4 of those had low-FSS scores (defined as the scale of the FSS_{useful} ≥ 109 grid points) and extensive discrepancies in the midsized fractions. However, poor agreement was not a necessary condition for a low-FSS score. The exceptions were forecasts that were structurally different from the observations either as a result of bias or variations in the precipitation distribution, but still relatively close in terms of precipitation placement. In this sample, 3 of the 14 forecasts with coverages >5% met these criteria. The remaining 7 forecasts had high-FSS scores and were considered to be good.

Although the 12-h mean FSS values were not much below those at 6 h, the fractions diagrams indicate numerous errors occurred. The observed and predicted fractions distributions are different despite an overall fractional bias of 0.95. The extended tails in the observed spatial (Fig. 10) and temporal (Fig. 11) fractions distributions indicate that widespread precipitation occurred for a small number of cases that the model was unable to fully capture. Also, the number of fractions in the 0.0–0.1 category increases for the two largest neighborhoods in the observations while staying fixed in the forecasts (Fig. 10). The fractions at the largest scale are sensitive to the bias, so these differences indicate the prevalence for large biases in individual forecasts. These biases are large enough to reduce the temporal correlations even at the single-point scale (Fig. 12).

As in Fig. 6, except that the fractions diagrams for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 6, except that the fractions diagrams for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 6, except that the fractions diagrams for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 7, except that the temporal fractions diagrams for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 7, except that the temporal fractions diagrams for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 7, except that the temporal fractions diagrams for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 8, except that the correlations for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 8, except that the correlations for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

As in Fig. 8, except that the correlations for COAMPS 12-h forecasts are displayed.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

At most scales the 12-h temporal correlations are much lower than at 6 h. Correlations for large-fraction events were slightly negative because those events were singularly represented by missed forecasts. A minor secondary minimum in correlation was apparent for low to midsized fractions at large scales, though the majority of the large-scale correlations were low. Upon visual inspection, most of the forecasts registered significant errors in structure and position, and none of them showed good agreement in the midsized fractions. However, a number of them still performed well in terms of the FSS. The best-performing forecasts were those where precipitation was more or less evenly distributed through the domain. In those cases, the fractions distributions narrowed rapidly with scale leading to relatively small FBSs. This explains the reduced resolution and sharpness in the attributes diagram for the 12-h forecasts in Fig. 5. These forecasts were somewhat useful, but only in the broad sense that a single probability could be issued covering most of the region.

Another factor leading to similar mean FSS scores at 6 and 12 h was of a number of low-precipitation forecasts that scored very poorly at both lead times. Since the mean FSS was not weighted by coverage, these forecasts tended to homogenize the scores. The weighting issue is not easily resolved because, unlike the threat score, an all-encompassing FSS cannot be calculated for the entire sample. Each forecast must be considered separately. The individual FSS scores could be weighted by the number of predicted or observed points or some combination thereof. Unfortunately, any choice of weighting of this type will result in some conditional bias. Here we simply chose to remove all cases where observed precipitation covered 5% of the domain or less (Fig. 13). By doing so the number of forecasts was reduced to 14 and 16 at 6 and 12 h, respectively. Removing all cases with forecast coverage less than 5% produced similar results. In general, an observation-based template is probably best in order to avoid model bias issues. The resulting FSS scores were more representative of the visual errors as well as the errors that were resolved by the fractions distributions. Smaller bin sizes would increase the sensitivity of the fractions diagrams for the low-coverage forecasts. However, the predictability of these events is likely to be very low as indicated by the low-FSS scores (Fig. 13).

FSS scores for all forecasts during the 2-week period are plotted for the 6- and 12-h forecasts as labeled. Solid FSS6L and FSS12L lines represent large events defined by observed domain precipitation coverage >5%. Dotted FSS6S and FSS12S lines represent small events defined by observed domain precipitation coverage ≤ 5%.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

FSS scores for all forecasts during the 2-week period are plotted for the 6- and 12-h forecasts as labeled. Solid FSS6L and FSS12L lines represent large events defined by observed domain precipitation coverage >5%. Dotted FSS6S and FSS12S lines represent small events defined by observed domain precipitation coverage ≤ 5%.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

FSS scores for all forecasts during the 2-week period are plotted for the 6- and 12-h forecasts as labeled. Solid FSS6L and FSS12L lines represent large events defined by observed domain precipitation coverage >5%. Dotted FSS6S and FSS12S lines represent small events defined by observed domain precipitation coverage ≤ 5%.

Citation: Monthly Weather Review 143, 11; 10.1175/MWR-D-14-00411.1

Finally, in a simple attempt to quantify forecast quality on a daily basis, a set of elementary criteria was established to define a good forecast. To be considered “useful” at a given scale a forecast was required to span at least three fractional coverage categories, the difference between the numbers of predicted and observed categories could not exceed 25%, and the FSS had to meet or exceed FSS_{useful}. Although these criteria are very rudimentary and will need to be refined, they illustrate a potential means for identifying forecasts with adequate skill. Applying the criteria to the fractions displayed in Fig. 9, the range of useful skill extended from neighborhoods 37 to 217. Applying them to the remaining forecasts individually revealed that 13 of the 26 6-h forecasts containing precipitation possessed some degree of usefulness. The average scale of usability spanned from 62 to 176 points with sample standard deviations of 27 and 72 points, respectively. None of the forecasts were useful at the single-point scale, but 6 were useful at the 37-point scale. Of the 13 forecasts that did not meet the usefulness criteria, 9 possessed fractional biases of greater than ±50%, indicating significant error. The remaining 4 forecasts either covered very small portions of the domain or, in one case, displayed very large displacement errors.

At 12 h, 12 of 27 forecasts containing precipitation surpassed the usefulness criteria. Only 3 of them met it at the 37-point scale, though 4 others met it at 73 points (121.91 km). The average range was 100 to 178 points with sample standard deviations of 60 and 58 points, respectively. Of the 15 forecasts not meeting the criteria 11 had fractional biases beyond ±50%. The other forecasts had a variety of issues ranging from moderate biases to very small coverage. Only two of the single-forecast fractional biases were within 15% of 1.0. At 6 h, 9 of the daily biases were within 15% of 1.0.

## 6. Discussion and conclusions

A good measure of any verification method is the degree of knowledge it provides about the forecasts. The neighborhood method and the accompanying FSS are very useful at providing condensed information about forecast accuracy. However, the FSS alone does not provide information on resolution and sharpness. Attributes diagrams generated from the neighborhood fractions indicate that significant variations in these quantities were not initially detected by the FSS. Part of the problem arose from the desensitization of the FSS by numerous neighborhoods that fell outside the bounds of the domain. These effectively slowed the rate at which the FSS increased with scale as a result of systematic reductions in the FBS and FBS_{worst}. Ensuring that all samples stayed within the grid bounds alleviated this problem. Another, more difficult issue involves the presence of unpredictable events. In this study, thresholding the FSS on domain coverage immediately showed that the forecasts of small phenomena collectively displayed very little skill. These phenomena consisted of about half of the forecasts at all lead times. In the literal sense, the FSS was correctly depicting consistently poor performance of the majority of the forecasts. However, should forecasts of inherently low-predictability events even be considered in this type of verification? If the strict operational characteristics of the model are being evaluated they probably should; however, differences in model physics or forecast lead time will not be as readily apparent if these events are retained.

Given the issues with the FSS, the attributes diagrams provide useful information in a relatively condensed format. They are especially relevant for depicting the elements contributing to forecast skill. The concept of the FSS_{useful} and its relation to the decomposition of the Brier skill score comes into better focus when presented in conjunction with the attributes diagram. Although the scale of the FSS_{useful} generally coincided with those scales that straddled the no-skill line, some forecasts possessed considerable skill at scales below that of FSS_{useful}.

The fractions diagrams pose the question of how much information can be mined from the fractional coverage distributions. At this point much of the interpretation is still subjective, but the diagrams provided insight about the general precipitation patterns and the degree to which the observations and forecasts agreed. Differences in spatial organization were subtly apparent in the structures of the distributions at each scale. The width of the distribution and its rate of collapse toward a single large-scale value provided information about how the precipitation entities were spatially organized. Rapid narrowing with scale indicated small precipitation elements that were evenly distributed, while slow narrowing indicated larger, inhomogeneous entities. The shape of the distributions was also influenced by the size and horizontal extent of the precipitation entities. Large events resulted in extended tails at high fractions, especially at small scales. In this study, these patterns were utilized to discover that the 6-h forecasts produced features that were structurally similar to the observations much of the time when coverage was greater than 5%. By 12 h, the structural similarities were considerably less, and the best forecasts were on days when precipitation was evenly distributed. The 12-h forecasts also missed at least two forecasts of widespread precipitation.

The degree of correspondence between bulk properties of the fractions distributions and the precipitation patterns is as yet unknown. This initial investigation shows some promise, but more rigorous studies need to be conducted. The effects of bias and pattern structure are combined in ways that may not be easily distilled. In many ways, domain size turns out to be an important issue. The relatively small domain used in this study effectively limited the set of events being sampled at any single time. Samples from large continent-sized domains will contain multiple events of varying size, organization, and predictability. Such large samples will dilute the fractions distributions in much the same way that they were diluted over time in this study as exemplified by contrast in structure between Figs. 6 and 9. Even predictability thresholds, such as the 5% coverage employed herein, would be difficult to implement on such a large scale. One possible remedy might include conducting studies like this over mesoscale areas, much like the domains used by the National Weather Service River Forecast Centers. The large increase in the amount of data, time, and effort may preclude this option. Another possible option would involve filtered templates (Davis et al. 2006) to remove regions of scattered precipitation and isolate the larger, more active regions. Such an option may be more suitable to meet the need for rapid, consolidated validation information.

## Acknowledgments

This research is supported by a grant from the Naval Surface Warfare Center Dahlgren Division (NSWCDD). Computer resources for the COAMPS simulations and data archival were supported in part by a grant of high performance computing (HPC) time from the Department of Defense Major Shared Resource Center, Stennis Space Center, Mississippi. The work was performed on an IBM iDataPlex computer. The River Forecast Centers stage IV data were collected from the NCAR CODIAC data server provided by NCAR/EOL under sponsorship of the National Science Foundation (http://data.eol.ucar.edu/).

## REFERENCES

Ben Bouallègue, Z. B., and S. E. Theis, 2014: Spatial techniques applied to precipitation ensemble forecasts: From verification results to probabilistic products.

,*Meteor. Appl.***21**, 922–929, doi:10.1002/met.1435.Casati, B., G. Ross, and D. B. Stephenson, 2004: A new intensity-scale approach for the verification of spatial precipitation forecasts.

,*Meteor. Appl.***11**, 141–154, doi:10.1017/S1350482704001239.Chen, S., and Coauthors, 2003: COAMPS version 3 model description—General theory and equations. NRL Tech Note NRL/PU/7500-03-448, 148 pp. [Available online at http://www.nrlmry.navy.mil/coamps_docs/base/docs/COAMPS_2003.pdf.]

Daley, R., and E. Barker, 2001: NAVDAS: Formulation and diagnostics.

,*Mon. Wea. Rev.***129**, 869–883, doi:10.1175/1520-0493(2001)129<0869:NFAD>2.0.CO;2.Davies, H. C., 1976: A lateral boundary formulation for multi-level prediction models.

,*Quart. J. Roy. Meteor. Soc.***102**, 405–418, doi:10.1002/qj.49710243210.Davis, C., B. Brown, and R. Bullock, 2006: Object-based verification of precipitation forecasts. Part I: Methods and application to mesoscale rain areas.

,*Mon. Wea. Rev.***134**, 1772–1784, doi:10.1175/MWR3145.1.Duc, L., K. Saito, and H. Seko, 2013: Spatial-temporal fractions verification for high-resolution ensemble forecasts.

,*Tellus***65A**, 18171, doi:10.3402/tellusa.v65i0.18171.Ebert, E. E., 2008: Fuzzy verification of high resolution gridded forecasts: A review and proposed framework.

,*Meteor. Appl.***15**, 51–64, doi:10.1002/met.25.Ebert, E. E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors.

,*J. Hydrol.***239**, 179–202, doi:10.1016/S0022-1694(00)00343-7.Hodur, R. M., 1997: The Naval Research Laboratory’s Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS).

,*Mon. Wea. Rev.***125**, 1414–1430, doi:10.1175/1520-0493(1997)125<1414:TNRLSC>2.0.CO;2.Hogan, T. F., M. S. Peng, J. A. Ridout, and W. M. Clune, 2002: A description of the impact of changes to NOGAPS convection parameterization and the increase in resolution to T239L30. Naval Research Laboratory Memo. Rep. NRL/MR/7530-02-52, 10 pp.

Hsu, W.-R., and A. H. Murphy, 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts.

,*Int. J. Forecasting***2**, 285–293, doi:10.1016/0169-2070(86)90048-8.Lockhoff, M., O. Zolina, C. Simmer, and J. Schulz, 2014: Evaluation of satellite-retrieved extreme precipitation over europe using gauge observations.

,*J. Climate***27**, 607–623, doi:10.1175/JCLI-D-13-00194.1.Mason, S. J., 2004: On using “climatology” as a reference strategy in the brier and ranked probability skill scores.

,*Mon. Wea. Rev.***132**, 1891–1895, doi:10.1175/1520-0493(2004)132<1891:OUCAAR>2.0.CO;2.Mittermaier, M., and N. Roberts, 2010: Intercomparison of spatial forecast verification methods: Identifying skillful spatial scales using the fractions skill score.

,*Wea. Forecasting***25**, 343–354, doi:10.1175/2009WAF2222260.1.Mittermaier, M., N. Roberts, and S. A. Thompson, 2013: A long-term assessment of precipitation forecast skill using the Fractions Skill Score.

,*Meteor. Appl.***20**, 176–186, doi:10.1002/met.296.Nachamkin, J. E., 2004: Mesoscale verification using meteorological composites.

,*Mon. Wea. Rev.***132**, 941–955, doi:10.1175/1520-0493(2004)132<0941:MVUMC>2.0.CO;2.Nachamkin, J. E., J. Schmidt, and C. Mitrescu, 2009: Verification of cloud forecasts over the eastern Pacific using passive satellite retrievals.

,*Mon. Wea. Rev.***137**, 3485–3500, doi:10.1175/2009MWR2853.1.Roberts, N. M., 2005: An investigation of the ability of a storm scale configuration of the Met Office NWP model to predict flood producing rainfall. Met Office Tech. Rep. 455, Joint Centre for Mesoscale Meteorology Rep. 150, 80 pp. [Available online at http://research.metoffice.gov.uk/research/nwp/publications/papers/technical_reports/2005/FRTR455/FRTR455.pdf.]

Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events.

,*Mon. Wea. Rev.***136**, 78–97, doi:10.1175/2007MWR2123.1.Romine, G. S., C. S. Schwartz, C. Snyder, J. L. Anderson, and M. L. Weisman, 2013: Model bias in a continuously cycled assimilation system and its influence on convection-permitting forecasts.

,*Mon. Wea. Rev.***141**, 1263–1284, doi:10.1175/MWR-D-12-00112.1.Rutledge, S. A., and P. V. Hobbs, 1983: The mesoscale and microscale structure and organization of clouds and precipitation in midlatitude cyclones. VIII: A model for the “seeder-feeder” process in warm-frontal rainbands.

,*J. Atmos. Sci.***40**, 1185–1206, doi:10.1175/1520-0469(1983)040<1185:TMAMSA>2.0.CO;2.Rutledge, S. A., and P. V. Hobbs, 1984: The mesoscale and microscale structure and organization of clouds and precipitation in midlatitude cyclones. XII: A diagnostic modeling study of precipitation development in narrow cold-frontal rainbands.

,*J. Atmos. Sci.***41**, 2949–2972, doi:10.1175/1520-0469(1984)041<2949:TMAMSA>2.0.CO;2.Schmidt, J. M., 2001: Moist physics development for the Naval Research Laboratory’s Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS). BACIMO, CD-ROM.

Schwartz, C. S., and Coauthors, 2010: Toward improved convection-allowing ensembles: Model physics sensitivities and optimizing probabilistic guidance with small ensemble membership.

,*Wea. Forecasting***25**, 263–280, doi:10.1175/2009WAF2222267.1.Söhne, N., J.-P. Chaboureau, and F. Guichard, 2008: Verification of cloud cover forecast with satellite observation over West Africa.

,*Mon. Wea. Rev.***136**, 4421–4434, doi:10.1175/2008MWR2432.1.Stratman, D. R., M. C. Coniglio, S. E. Koch, and M. Xue, 2013: Use of multiple verification methods to evaluate forecasts of convection from hot- and cold-start convection-allowing models.

,*Wea. Forecasting***28**, 119–138, doi:10.1175/WAF-D-12-00022.1.Theis, S. E., A. Hense, and U. Damrath, 2005: Probabilistic precipitation forecasts from a deterministic model: A pragmatic approach.

,*Meteor. Appl.***12**, 257–268, doi:10.1017/S1350482705001763.Weusthoff, T., F. Ament, M. Arpagaus, and M. W. Rotach, 2010: Assessing the benefits of convection-permitting models by neighborhood verification: Examples from MAP D-PHASE.

,*Mon. Wea. Rev.***138**, 3418–3433, doi:10.1175/2010MWR3380.1.Wilks, D. S., 2006:

2nd ed. Elsevier, 627 pp.*Statistical Methods in the Atmospheric Sciences.*Zacharov, P., and D. Rezacova, 2009: Using the fractions skill score to assess the relationship between an ensemble QPF spread and skill.

,*Atmos. Res.***94**, 684–693, doi:10.1016/j.atmosres.2009.03.004.Zepeda-Arce, J., E. Foufoula-Georgiou, and K. K. Droegemeier, 2000: Space-time rainfall organization and its role in validating quantitative precipitation forecasts.

,*J. Geophys. Res.***105**, 10 129–10 146, doi:10.1029/1999JD901087.

^{1}

COAMPS is a registered trademark of the Naval Research Laboratory.