## 1. Introduction

When performing model intercomparison of precipitation nonprobabilistic skill scores, an accurate experimental design is needed. In particular, if compared models have significantly different grid-box sizes, it is imperative to verify precipitation forecasts on a common grid. On the other hand, if rain gauge data are simply box averaged, the smoothness of the analysis will depend on the average number of rain gauges per grid box (Mesinger 1996). Moreover, a general postprocessing procedure (grid transformation) applied to the forecast precipitation fields might change the skill scores in some statistically significant way. In fact, McBride and Ebert (2000) suggest that spatial averaging or bilinear interpolation may affect the comparative skills of the models.

In view of the above problem, this paper investigates the effect on skill scores of two simple postprocessing procedures applied to precipitation field forecast by a limited area model (LAM).

For this purpose, LAM forecasts will be verified on their native grids, whereas postprocessed forecasts will be verified on a different grid, as will be seen in more detail later. The two verification grids considered have a grid-box size of about 10 km; hence, to produce an adequate (i.e., less sensitive to grid-box size) observed rainfall analysis, a Barnes objective analysis scheme is used (Barnes 1964, 1973; Koch et al. 1983).

The QUADRICS Bologna Limited Area Model (QBOLAM), running operationally at the Dipartimento dei Servizi Tecnici Nazionali of Presidenza del Consiglio dei Ministri (Department of National Technical Services of the Italian Cabinet Presidency, DSTN-PMC) is used to study the effect of the different interpolation procedures on skill scores. QBOLAM is derived from the Bologna Limited Area Model (BOLAM; Buzzi et al. 1994), which has shown excellent forecast capabilities in the Comparison of Mesoscale Prediction and Research Experiments (COMPARE; Georgelin et al. 2000; Nagata et al. 2001). The considered procedures are bilinear interpolation and a simple nearest-neighbor average method (Baldwin 2000; Mesinger 1996; Mesinger et al. 1990), also known as remapping or budget interpolation. For brevity, this last procedure will be called remapping in the remainder of the paper.

As will be seen in more detail later, these methods differ in many substantial ways, since, for example, bilinear interpolation treats gridpoint precipitation values as defined at points, while remapping considers gridpoint values as grid-box values.

The skill scores used are the equitable threat score (ETS; Schaefer 1990) and the Hanssen–Kuipers score (HK; Hanssen and Kuipers 1965; McBride and Ebert 2000). This last score is also known as the true skill statistic (TSS; Flueck 1987) or the Peirce skill score (PSS; Peirce 1884; Stephenson 2000). Another useful score is the bias score, or BIA. It is used to compare the relative frequency of precipitation forecasts and observations (for a review see, e.g., Wilks 1995).

In order to determine from a statistical point of view whether the results on different grids are equivalent or not, a hypothesis test is needed. Common hypothesis tests need an a priori definition of a parametric probability density function (PDF) and the assumption of statistical independence for the quantities under examination.

The hypothesis testing method used in this study was originally proposed by Hamill (1999). It is based on a resampling technique called the bootstrap method [Diaconis and Efron (1983); for a general discussion on meteorological applications see Wilks (1995) and references therein]. This is a computer-based method that builds a PDF consistent with the selected null hypothesis. Random sampling is performed from the available data, followed by a significance assessment of the test based on a comparison of the observed statistic with the numerically built statistics. The general method does not need any assumptions concerning the probability density function.

According to Hamill (1999) skill score comparison between two competing models may not be completely informative on the actual forecasting skills, because one model might have inflated scores due to its tendency to overforecast a precipitation occurrence. He proposes the BIA adjustment method when performing a skill score comparison, in order to take into account the precipitation forecast frequency difference that two different models may have. The hypothesis testing method will be applied in this paper with and without BIA adjustment.

The application of this procedure should help to determine whether the obtained scores are the result of an actual forecast capability or are simply due to BIA differences.

The paper is organized as follows. Section 2a describes the precipitation test data used. Section 2b describes the QBOLAM LAM and the two different grids used in this study are described, while 2c briefly describes the Barnes objective analysis scheme. A description of the bilinear interpolation and remapping procedures is provided in section 3. Section 4a gives a short review of the nonprobabilistic skill scores used. Section 4b illustrates the BIA adjustment procedure. Section 4c describes the bootstrap resampling technique. Results are presented in section 5 and discussed in section 6. Conclusions and final remarks are in section 7. The appendix contains a detailed analysis of the effect of the BIA adjustment procedure on ETS and HK.

## 2. Verification and forecast data

### a. Rain gauge verification data

The rain gauge data used in the European Project INTERREG II C (Gestione del territorio e prevenzione dalle inondazioni, land management and floods prevention) have been used to study the effects of bilinear interpolation on precipitation forecast statistics and to apply the above described hypothesis test. Such rain gauge data cover the Italian regions of Liguria and Piedmont (northwestern Italy), starting from 1 October 2000 through 31 May 2001 (243 days). The verification area corresponds approximately to the shaded area in Fig. 1. This part of Italy is characterized by the presence of high mountains, the northwestern Alps (Fig. 2).

The Liguria rain gauge data were provided by the Environmental Monitoring Research Center of the University of Genova (Centro di Ricerca Interuniversitario in Monitoraggio Ambientale, CIMA). The Regional Monitoring Network provided the Piedmont region data. The Piedmont rain gauge network has 294 rain gauges while the Liguria network has 96 rain gauges. Not all rain gauges were active during the time interval considered, going from a maximum of about 190 rain gauges over the two regions in October to a minimum of 70 gauges. The average number is about 90 rain gauges.

Time resolution is variable from 5 to 30 min. Rainfall records were accumulated for 24 h. Observation thresholds used for calculation of the nonparametric scores are 0.5, 5.0, 10.0, 20.0, 30.0 and 40.0 mm (24 h)^{−1}.

### b. Description of model QBOLAM and related grids

QBOLAM is a finite-difference, primitive equation, hydrostatic limited area model running operationally on a 128-processor parallel computer (QUADRICS) at DSTN-PCM as an element of the Poseidon sea wave and tide forecasting system. It is a parallel version of the BOLAM model described by Buzzi et al. (1994), which has been developed at the Istituto di Scienze dell'Atmosfera e del Clima–Consiglio Nazionale delle Ricerche (ISAC–CNR, former FISBAT). Analysis and boundary conditions are provided by the European Centre for Medium-Range Weather Forecasts (ECMWF). A 60-h forecast starts daily at 1200 UTC. The first 12-h forecasts are neglected (spinup time), and only the following 48-h forecasts are considered. Outputs are available routinely every 3 h and are considered here only up to 24 h. The horizontal domain [19.9° × 38.5° of rotated (defined below) latitude × longitude] covers the entire Mediterranean Sea, with 40 levels in the vertical (sigma coordinates). Horizontal grid-box spacing is 0.1° for both latitude and longitude on the original computational grid, equivalent to 386 × 210 grid points. For computational and parallelization reasons, simplified parameterization of convection (Kuo 1974) and radiation (Page 1986) are used.

The computational grid is a rotated Arakawa C grid (in geographical coordinates), where the rotated equator goes across the domain's midlatitude, in order to minimize grid anisotropy (Buzzi et al. 1998). The rotated grid does not have constant spacing (in °) in actual latitude and longitude and its center is located at the point of the geographical coordinates (38.5°N, 12.5°E). The grid size is about 11 km.

The model output on the rotated grid is then interpolated using either bilinear interpolation or remapping to a 17° × 42.8° latitude × longitude grid, with an equal spacing of 0.1° for both coordinates (429 × 171 grid points). The grid center has the coordinates (38.5°N, 14.4°E). This grid is labeled P (postprocessing grid), while the rotated grid is labeled N (native).

The N and P grid domains are shown in Fig. 1 using the Mercator projection. An example of the layout of the two grids is presented in Fig. 3.

### c. Barnes objective analysis scheme

The high-resolution grid spacing can make the precipitation analysis questionable if a simple grid-box average of rain gauge data is used. In fact, about 70% of the grid boxes considered have only a single rain gauge in their interior (not shown). It may be preferable then to perform the verification on a coarser grid so that there are more precipitation observations per grid box, and, consequently, the analysis is more robust. An alternative to such an approach is the use of an objective analysis scheme, in order to have a precipitation analysis that is less sensitive to the grid spacing. We use a Barnes (1973) scheme. This is a Gaussian weighted-averaging technique, assigning a weight to an observation as a function of distance between the observation and the grid point. The implementation described by Koch et al. (1983) is applied to analyze the precipitation observations either on the N grid or the P grid. A first pass is performed to produce a first-guess precipitation analysis, followed by a second pass, which is needed to increase the amount of detail in the first guess. This is made using a numerical convergence parameter *γ* (0 < *γ* < 1) that forces the agreement between the observed precipitation field and the second pass interpolated field. Moreover, Koch et al. (1983) have shown that only two passes through the data are needed “to achieve the desired scale resolution because of the rapidity with which convergence is reached.”

Another appealing feature of the Barnes technique is that the response function determination is made a priori. This allows easy calculation of the weight parameter, given the average data spacing. Furthermore, to produce a smoother analysis it is possible to manually set the value of the average data spacing that can be greater or equal to the actual average data spacing. For the purposes of this study the average data spacing has been set to 0.2°. This choice is consistent with the constraint that the ratio between grid size and the selected data spacing lies approximately between 0.3 and 0.5 (Barnes 1964, 1973). Grid points that do not have a rain gauge within a radius of 0.15° are neglected, to avoid the excessive smearing introduced by the analysis scheme on grid points far from the actual location of rain gauges.

The convergence parameter *γ* is set to a value of 0.2 that produces the maximum gain in detail with such design. For a detailed description of the objective analysis methodology the reader is referred to Koch et al. (1983).

## 3. Bilinear interpolation and remapping procedures

### a. Bilinear interpolation

Grid transformations may be performed using bilinear interpolation, which is operationally used (see, e.g., ECMWF 2001). However this method may not be desirable for precipitation, because it results in smoothing of the precipitation field, increasing the minima and reducing the maxima. In addition, the bilinear interpolation method does not conserve the total precipitation forecast by the model.

### b. Remapping

The other method considered is remapping, which is used to integrate forecast precipitation from the native grid to the postprocessed one (Baldwin 2000; Mesinger 1996; Mesinger et al. 1990).

The remapping procedure, used operationally at the National Centers for Environmental Prediction/Environmental Modeling Center (NCEP/NMC), is applied. It conserves, to a desired degree of accuracy, the total forecast precipitation of the native grid. This integration technique is a discretized version of a remapping procedure that uses an area-weighted average.

The remapping was performed by subdividing the boxes centered on each postprocessing grid point in 5 × 5 subgrid boxes, and assigning to each subgrid point the value of the nearest native grid point. The average of these subgrid point values produces the remapped value of the postprocessing grid point. An example of how remapping works is shown in Fig. 4. The subgrid boxes included in the shaded area are assigned the value of the N grid point labeled 2, whereas the subgrid boxes included in the unshaded area are assigned the value of the N grid point labeled 5.

## 4. Statistical methodology used for the sensitivity study

### a. Nonprobabilistic scores

Nonprobabilistic verification measures are widely used to evaluate categorical forecast quality. A categorical dichotomous statement is simply a yes–no statement, in this particular context it means that forecast or observed precipitation is below or above a defined threshold *θ.* The combination of different possibilities between observations and the forecast defines a contingency table. For each precipitation threshold four categories of hits, false alarms, misses, and correct nonrain forecasts (*a,* *b,* *c,* and *d* as shown in Table 1) are defined. The scores used in this study are the above-mentioned ETS and HK scores (see Schaefer 1990; Hanssen and Kuipers 1965) and the BIA.

A BIA equal to one means that the forecast frequency is the same as the observation frequency, and the model is then referred to as unbiased. A BIA greater than 1 is then indicative of a model that forecast more events above the threshold than observed (overforecasting), while a BIA less than 1 shows that the model forecast less events than observed (underforecasting).

*a*

_{r}is the number of model hits expected from a random forecast:

A perfect forecast will have an ETS equal to 1, while an ETS that is close to 0 or negative will indicate a questionable ability of the model to produce successful forecast statistics.

The HK ranges from −1 to 1. This score is independent of the event and nonevent distributions. An HK score equal to 1 is associated with a perfect forecast both for events and nonevents, while a score of −1 means that hits and correct zero-precipitation forecasts are zero. The HK score is equal to 0 for a constant forecast.

^{1}(F; Stephenson 2000), it is possible to define the HK score as a linear combination of these indexes (Stephenson 2000); that is,

This expression of HK will be used to interpret the effects of the BIA adjustment procedure.

### b. The BIA adjustment procedure

Mesinger (1996) and Hamill (1999) pointed out that although ETS is in the form expressed in Eq. (2), corrected for hits in a random forecast using Eq. (3), it is still influenced by model BIA. The skill of a random forecast is small for intense precipitation; hence, at higher thresholds, a model with a greater BIA should normally exhibit a greater ETS (Mesinger 1996). Nevertheless, an overforecasting model (BIA > 1) could have too many false alarms so that the skill scores could be negatively affected compared with a model with a lower BIA.

Mason (1989) has shown that a skill score very similar to ETS, the critical success index (CSI; Donaldson et al. 1975), is highly dependent on threshold probability, that is, the cutoff probability above which the event is forecast to occur and below which is forecast to not occur, or not forecast. Moreover, he indicates that the HK score also is sensitive to the probability of the threshold selected.

The threshold probability is a function of the actual precipitation threshold used to compute the contingency tables (Mason 1989).

Usually the contingency tables are filled using the same precipitation threshold for both observations and forecasts. To determine the effects on skill scores of BIA differences between two models (“competitor” and “reference” in the following), the BIA adjustment procedure is introduced. In this study, the two compared models are the forecasts on the native grid and the postprocessed forecasts using either bilinear interpolation or remapping. In the BIA adjustment procedure the contingency tables of the competitor forecast are recalculated, adjusting the forecast precipitation threshold, while keeping the observation threshold unchanged, in order to have the competitor BIA similar to the reference BIA. This procedure is repeated independently for each verification threshold (T. M. Hamill 2000, personal communication).

The reference forecast should have the BIA closer to unity for almost all the verification thresholds, that is, showing the least tendency to underforecast or overforecast.

The variation of the forecast threshold can be interpreted as a variation of the threshold probability in order to get similar BIA scores.

The BIA adjustment helps to assess whether the skill score differences between the competitor and reference models are the result of an actual different forecast capability, or are simply explained by the differing BIA scores.

The appendix [Eqs. (A6) and (A8)] shows that ETS and HK vary due to the change in the number of hits and false alarms. This is expected because of the BIA adjustment procedure, which changes the marginal distributions of the forecasts, by changing the forecast thresholds, and thus leaves the marginal distributions of the observations unaltered.

### c. The resampling method

For the purpose of model comparison, a confidence interval is necessary to assess whether score differences between two competing models are statistically significant. This is computed using as a hypothesis test the aforementioned Hamill resampling technique (Hamill 1999). A short description of the resampling method follows; for an exhaustive discussion of the details of the method the reader is referred to Hamill (1999).

This test (like other hypothesis test methods) requires that the time series considered have negligible autocorrelations. Accadia et al. (2003) have shown that model forecast errors are negligibly autocorrelated in time for the dataset used in this paper if observations and forecasts are accumulated to 24 h. Moreover, in order to take into account any possible spatial correlation of forecast error when applying the resampling method, the nonparametric scores used in this study are calculated on the whole set of grid points available each day (Hamill 1999).

The test is also applied to HK, which is used in place of ETS. For the purpose of this study, the competing model forecasts M1 and M2 will be two different outputs from the same model; one on the native grid, the other on the postprocessing grid, either using bilinear interpolation or remapping, as previously described in section 3.

*n*daily contingency tables (

*n*= 243 days) calculated from data accumulated to 24 h. Each table is defined as a vector of four elements:

**x**

_{i,j}

*a,*

*b,*

*c,*

*d,*

_{ij}

*i*

*j*

*n,*

*i*represents here the QBOLAM output either on the N grid or interpolated on the P grid;

*j*is the day index. The contingency table elements are summed over the entire time period for both model forecasts M1 and M2:

*I*

_{k}}

_{k=1, … , n}be a random indicator family over the entire set of

*n*days, where each element can be randomly set equal to 1 or 2. It is equivalent to randomly choosing a particular daily contingency table associated either with model M1 or M2, respectively. Using these random indexes, the resampling method is applied by summing the shuffled contingency table vectors from Eq. (8) over the

*n*days:

*I*

_{k}). In other words, if

*I*

_{k}= 2, then the two daily contingency tables for day

*k*are swapped, while if

*I*

_{k}= 1, the swapping is not performed. In this way, it is possible to build different

*N*

_{B}sample sums [Eqs. (11) and (12)] by generating

*N*

_{B}new random indicator families. As already mentioned, the summation reduces the sensitivity of the scores to small changes in the contingency table population. The resampled statistic is calculated here

*N*

_{B}= 10 000 times to build the null distribution. Note that no hypothesis is needed on the resampled PDF.

A significance level of *α* = 0.05 is assumed for all tests performed in this study. The null hypothesis *H*_{0} is tested by a two-tailed test using the percentile method (Wilks 1995).

*α*)% confidence interval is determined by checking the position of the score differences, for example (

_{M1}−

_{M2}), in the resampled distribution of (

^{*}

_{M1}

^{*}

_{M2}

*N*

_{B}

*α*/2 of the

*N*

_{B}bootstrap sample differences. In this way, the numbers

*t̂*

_{L}and

*t̂*

_{U}, which satisfy the following relations, are computed as where Pr*[ ] is representative of the probabilities computed from the null distribution. The null hypothesis in Eq. (6) is then verified if in the example (

_{M1}−

_{M2}) >

*t̂*

_{L}and (

_{M1}−

_{M2}) <

*t̂*

_{U}.

Score differences outside the interval (*t̂*_{L}, *t̂*_{U}) are statistically significant (for the chosen *α*); that is, the hypothesis *H*_{0} is rejected.

## 5. Results

The bootstrap hypothesis test has been applied to the P- and the N-grid outputs to check whether a grid transformation via either the bilinear interpolation or the remapping produce statistically significant differences on the nonparametric scores.

The results for 24-h accumulation time are shown with and without the BIA adjustment, respectively.

### a. Effect of bilinear interpolation

Figure 5a shows clearly that bilinear interpolation significantly reduces the BIA score for precipitation thresholds greater than 5 mm (24 h)^{−1}, while it is slightly (but still significantly) increased for the lowest threshold. The ETS (Fig. 5b) and HK (Fig. 5c) skill scores also show significant differences.

It must be pointed out that the forecast's skill remains unaltered; in fact, the observed change in the verification statistics is a computational artifice introduced by bilinear interpolation on a different grid.

The smoothing effect of bilinear interpolation on the precipitation field produces a decrease in the original maxima (decreasing significantly the BIA), while the minima are increased. The smoothing also produces a smearing of the field, decreasing the gradients across the rain–no-rain boundaries. This effect inflates significantly the ETS and HK scores. Actual rainfall observations on the edges of the forecast rain–no-rain boundaries, verified on the P grid, can be associated with an increased value of forecast precipitation, rewarding the interpolated forecast with a hit instead of a miss. On the other hand, if precipitation above a certain threshold is not observed, this edge-smearing effect introduced by the interpolation may decrease the forecast precipitation in such a way that a false alarm becomes a correct no-rain forecast.

Figures 5d–f show the results of the hypothesis tests after performing the BIA adjustment. The skill score differences remain significant; hence, it is possible to infer that these differences are actually due to an artificially improved capability in forecasting precipitation when the original forecast is bilinearly interpolated on the P-grid. The response of ETS and HK might be surprising at first sight, since the N-grid ETS shows some small improvements after the adjustment, while HK evidently decreases. This different behavior can be interpreted using Eq. (A6) for ETS and (A8) for HK, in the appendix. The actual changes induced by the BIA adjustment on the N-grid contingency table elements for each threshold are reported in Table 2. Table 3 shows the relative differences. Equation (A6) shows that ETS is sensitive to the change of hits, random hits, and false alarms. As can be seen from Table 2, in the part concerning bilinear interpolation, the decrease of hits for thresholds greater than 0.5 mm (24 h)^{−1} is always smaller compared to the decrease of false alarms. This produces the observed increase of ETS after BIA adjustment. Equation (A8) shows that the BIA-adjusted HK can be expressed as a function of the relative changes of hits and false alarms. The false alarm relative differences are always greater than the hits relative differences. On the other hand the false alarm relative difference is multiplied by F, while the hits relative difference is multiplied by POD [Eqs. (5) and (A8)]. POD has a value that ranges from about 0.7 down to 0.55, while F goes from about 0.2 to 0.05 (not shown). Hence, despite the inbalance between relative changes of hits and false alarms, the HK variation is dominated by the relative change of hits.

### b. Effect of remapping

Remapping on the P grid produces a statistically significant variation of the considered scores (Figs. 6a–c). The BIA score shows the same qualitative behavior previously discussed for bilinear interpolation, while the decrease of BIA for thresholds greater than 5 mm (24 h)^{−1} is less pronounced (Fig. 6a). The remapping ETS (Fig. 6b) has lower values for all thresholds, compared with the bilinear interpolation ETS (see Fig. 5b), but score differences remain significant (although small) when compared with the N-grid ETS. The HK score shows a significant improvement only for the two lowest thresholds (Fig. 6c). The remapping, by design, produces a reduced edge smearing on precipitation forecasts, while the precipitation maxima are not changed very much by the simple average. This is consistent with the property described by Mesinger (1996) that remapping conserves the total precipitation amount. The same considerations previously done about the bilinear interpolation impact on contingency table elements remain valid here, but skill scores are less affected due to the reduced smoothing introduced by remapping.

All these results also show that the remapping artificially changes verification statistics, although in a less striking way.

Application of the BIA adjustment procedure (Figs. 6d–f) shows that the remapped forecast has an unambiguously better ETS only for the lowest two thresholds (Fig. 6e). The other ETS differences are explained simply by the tendency of the forecast on the N grid to overforecast precipitation, thus producing a number of false alarms that actually decreases the ETS score, despite the high BIA. The same considerations about the BIA influence are obviously valid for the previous comparison between the N-grid forecasts and the bilinearly interpolated forecasts. The point is that skill score differences cannot be explained only by the BIA difference, but are actually due to the effect of bilinear interpolation on the precipitation field.

The hypothesis test on HK produces the same qualitative outcomes as the test without the BIA adjustment. As mentioned before, remapping produces a relatively small edge smearing of the precipitation field, which can equally contribute to increase the number of successes (hits and correct no-rain forecasts), especially at the lowest thresholds. Finally, HK and ETS associated with the N-grid show the same qualitative behavior (HK decreases and ETS slightly increases) after BIA adjustment previously discussed for bilinear interpolation using Eqs. (A6) and (A8) and Tables 2 and 3 (see parts about remapping).

The ETS skill score is generally more affected by the grid transformation process. Relatively small changes on hits, misses, and false alarms affect ETS more than HK, especially at higher threshold values, where the number of correct no-rain forecasts is much higher than the other elements of the contingency table. This reduces the sensitivity of HK to false alarms (Stephenson 2000) introduced by the grid transformation.

### c. Verification scores summary

A summary of BIA, ETS, and HK scores for both grid transformation methods is presented in Tables 4, 5, and 6 respectively. The two methods change significantly (although artificially) the verification scores, with bilinear interpolation having a greater impact. Bilinear interpolation does not conserve total precipitation; therefore, this result is not particularly striking. It is more surprising that remapping produces statistically significant score changes. Remapped skill scores, however, are generally closer to those computed on the native grid.

## 6. Discussion

The application of the hypothesis tests shows that the above-mentioned changes of the scores due to bilinear interpolation and remapping on a high-resolution grid are statistically significant. In the situation considered, with a model that has a BIA > 1 for all thresholds above 5 mm (24 h)^{−1}, bilinear interpolation has an inflating effect on the ETS and HK scores. This may seem to be a desirable feature, but the application of bilinear interpolation on a model with an underforecasting tendency could result in a general worsening of the skill scores. Moreover, bilinear interpolation does not conserve the total precipitation, unlike remapping.

The negative impact of bilinear interpolation on the total precipitation has been verified considering the mean daily differences between the P- and N-grid precipitation totals relative to four different areas, as indicated in Fig. 2. Another useful quantity that is calculated is the relative mean difference, usually expressed as a percentage, which is the ratio between the mean total precipitation difference and the mean of the daily simple average of the N- and P-grid precipitation totals for the area under consideration. The two averages are computed over 243 days, for both methods. The results, presented in Table 7, show that the mean difference and the relative mean difference are always smaller for the remapping method; the relative mean differences do not exceed 3% on average. Mean differences are always negative for bilinear interpolation, since it smoothes the forecast peaks of precipitation, reducing the total precipitation, despite the effect of increasing the minima. Remapping introduces a small positive precipitation bias for areas 2 and 4, that is, the areas that cover much part of the Alps. This small positive mean difference can be explained by a small spreading of precipitation introduced by remapping, slightly increasing the minima, and leaving the maxima almost unaltered.

It is interesting to see how the mean differences between the bilinearly interpolated precipitation field and the field interpolated via remapping are distributed over the verification domain. Figure 7 shows the mean difference between remapping and bilinear interpolation calculated over 243 days. There is a clear anisotropy induced by the Alps, as shown also by the relative differences in Fig. 8, that is, the differences weighted by the average precipitation of the simple mean of both configurations. It is also evident that the greater smoothing effect introduced by bilinear interpolation is noticeable from some “tripolar” difference patterns, with negative values outside and positive values inside. These patterns are due to the aforementioned decrease of maxima (remapping produces on average greater values) and to an increase of minima (bilinearly interpolated values are on average greater than remapped values). The greater forecast differences are observed over the northwestern part of the Alps, on the French and Swiss sides.

This is due to the fact that this part of Italy is mainly affected by Atlantic disturbances. Low-level convergence of moist air induced by the cyclonic systems and reinforced by the orography produces higher precipitation over the Alps. It is then likely that QBOLAM BIA scores greater than one are due to an overforecasting tendency over this mountain range. The previous discussion indicates that remapping does not drastically change the qualitative behavior of the model forecast, although it has a significant impact on skill scores, especially at lower threshold values.

A skill score comparison with another mesoscale model on a common verification grid would be fairer using a remapping procedure. Moreover the grid transformation would affect the total precipitation field less.

## 7. Conclusions

This paper discusses the effect on BIA, equitable threat score, and Hanssen–Kuipers score of two widely used grid transformation methods (bilinear interpolation and remapping) when interpolating precipitation forecasts from one high-resolution grid to another.

It has been shown that for the experimental setting used (grid-box size of about 10 km), the two methods introduce small variations on the forecast precipitation field that change in a significant way the considered skill scores and the BIA score. Bilinear interpolation does not conserve total precipitation in a satisfactory way (i.e., systematically decreased), and it may affect skill scores more than remapping. This is not a desirable feature, although it introduces here an artificial improvement (QBOLAM has a BIA greater than 1); it could also produce a significant *decrease* in skill scores when considering an underforecasting model, due to an oversmoothing of the precipitation field.

The ETS skill score is more sensitive to both grid transformation methods than HK. The HK score weights all successes (hits and correct no-rain forecasts) while ETS responds mainly to hits, misses, and false alarms; it is weakly dependent on correct no-rain forecasts. The correct no-rain forecasts often outnumber the other elements of the contingency table, reducing the sensitivity of HK to false alarms (Stephenson 2000) introduced by the interpolation process. This is particularly true at higher thresholds for both methods.

Only a particular aspect of the grid transformation issue has been considered here, that is, high-resolution grid transformations. Precipitation forecasts show greater spatial variability and strong gradients at higher resolution. Verification (and intercomparison) of mesoscale models on a coarser grid should be less sensitive to small forecast displacements errors. The impact on skill scores and precipitation budgets of different interpolation methods, when the final grid has a grid-box size different from the original one, will be the object of future studies.

## Acknowledgments

We would like to thank the Environmental Monitoring Research Center of the University of Genova, Italy, that provided the Liguria rainfall dataset, and Dr. G. Boni for his help. The Regional Monitoring Network of Piedmont Region, Turin, Italy, provided the Piedmont rainfall dataset; our thanks go also to Mr. S. Bovo and Mr. R. Cremonini. We thank also Mr. Antonio De Venere (DSTN-PCM, Rome, Italy) for his continuous help and encouragement, and Mr. C. Transerici (Instituto di Scienze dell'Atmosfera e del Clima–CNR, Rome, Italy) for his help in writing and testing the remapping code. Comments and suggestions received from Dr. M. E. Baldwin and Dr. T. M. Hamill were very useful. Finally, two anonymous reviewers have made several helpful suggestions to improve the quality of the paper.

## REFERENCES

Accadia, C., and Coauthors. 2003: Application of a statistical methodology for limited area model intercomparison using a bootstrap technique.

,*Il Nuovo Cimento***26C****,**61–77.Baldwin, M. E., cited 2000: QPF verification system documentation. [Available online at http://sgi62.wwb.noaa.gov:8080/testmb/verfsp.doc.html.].

Barnes, S. L., 1964: A technique for maximizing details in numerical weather map analysis.

,*J. Appl. Meteor.***3****,**396–409.Barnes, S. L., 1973: Mesoscale objective analysis using weighted time-series observations. NOAA Tech. Memo. ERL NSSL-62, National Severe Storm Laboratory, Norman, OK, 60 pp. [NTIS COM-73-10781.].

Buzzi, A., , Fantini M. , , Malguzzi P. , , and Nerozzi F. , 1994: Validation of a limited area model in cases of Mediterranean cyclogenesis: Surface fields and precipitation scores.

,*Meteor. Atmos. Phys.***53****,**53–67.Buzzi, A., , Tartaglione N. , , and Malguzzi P. , 1998: Numerical simulation of the 1994 Piedmont flood: Role of orography and moist processes.

,*Mon. Wea. Rev.***126****,**2369–2383.Diaconis, P., , and Efron B. , 1983: Computer-intensive methods in statistics.

,*Sci. Amer.***248****,**116–130.Donaldson, B. J., , Dyer R. , , and Kraus R. , 1975: An objective evaluator of techniques for predicting severe weather events. Preprints,

*Ninth Conf. on Severe Local Storms,*Norman, OK, Amer. Meteor. Soc., 321–326.ECMWF, cited 2001: Grid point to grid point interpolation. [Available online at http://www.ecmwf.int/publications/manuals/libraries/interpolation/gridToGridFIS.html.].

Flueck, J. A., 1987: A study of some measures of forecast verification. Preprints,

*10th Conf. on Probability and Statistics in Atmospheric Sciences,*Edmonton, AB, Canada, Amer. Meteor. Soc., 69–73.Georgelin, M., and Coauthors. 2000: The second COMPARE exercise: A model intercomparison using a case of a typical mesoscale orographic flow, the PYREX IOP3.

,*Quart. J. Roy. Meteor. Soc.***126****,**991–1030.Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts.

,*Wea. Forecasting***14****,**155–167.Hanssen, A. W., , and Kuipers W. J. A. , 1965: On the relationship between the frequency of rain and various meteorological parameters.

,*Meded. Verh.***81****,**2–15.Koch, S. E., , desJardins M. , , and Kocin P. J. , 1983: An interactive Barnes objective map analysis scheme for use with satellite and conventional data.

,*J. Climate Appl. Meteor.***22****,**1487–1503.Kuo, H. L., 1974: Further studies of the parameterization of the influence of cumulus convection on large scale flow.

,*J. Atmos. Sci.***31****,**1232–1240.Mason, I., 1989: Dependence of the critical success index on sample climate and threshold probability.

,*Aust. Meteor. Mag.***37****,**75–81.McBride, J. L., , and Ebert E. E. , 2000: Verification of quantitative precipitation forecasts from operational numerical weather prediction models over Australia.

,*Wea. Forecasting***15****,**103–121.Mesinger, F., 1996: Improvements in quantitative precipitation forecasting with the Eta regional model at the National Centers for Environmental Prediction: The 48-km upgrade.

,*Bull. Amer. Meteor. Soc.***77****,**2637–2649.Mesinger, F., , Black T. L. , , Plummer D. W. , , and Ward J. H. , 1990: Eta model precipitation forecasts for a period including Tropical Storm Allison.

,*Wea. Forecasting***5****,**483–493.Nagata, M., and Coauthors. 2001: Third COMPARE workshop: A model intercomparison experiment of tropical cyclone intensity and track prediction.

,*Bull. Amer. Meteor. Soc.***82****,**2007–2020.Page, J. K., 1986:

*Prediction of Solar Radiation on Inclined Surfaces*. Solar Energy R&D in the European Community: Series F, Vol. 3, Dordrecht Reidel, 459 pp.Peirce, C. S., 1884: The numerical measure of the success of predictions.

,*Science***5****,**453–454.Schaefer, J. T., 1990: The critical success index as an indicator of warning skill.

,*Wea. Forecasting***5****,**570–575.Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosing forecast skill.

,*Wea. Forecasting***15****,**221–232.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences*. Academic Press, 467 pp.

## APPENDIX

### Effect of BIA Adjustment on ETS and HK Scores

*a,*

*b,*

*c,*

*d*) and (

*a*′,

*b*′,

*c*′,

*d*′), respectively, be the vector of the contingency table elements before and after the application of the BIA adjustment procedure. It is possible to rewrite them as where Δ

*a,*Δ

*b,*Δ

*c,*and Δ

*d*represent the element variations. These quantities are linked together as follows. Introducing a forecast threshold

*θ*

_{f}different from the observed threshold

*θ,*only the number of forecast yes–no changes according to the difference between these two thresholds. If

*θ*

_{f}is greater than

*θ,*then the number of yes forecasts increases and the number of no forecasts decreases by the same quantity; on the other hand, if

*θ*

_{f}is less than

*θ,*then the number of yes forecasts decreases and the number of no forecasts increases by the same quantity. Then it is easy to verify that

*a*

^{′}

_{r}

*R*

_{a}= Δ

*a*/

*a*and

*R*

_{b}= Δ

*b*/

*b,*that is, the relative differences of the hits and false alarms.

Contingency table of possible events for a selected threshold

The N-grid category differences after BIA adjustment to P-grid BIA. Results are shown for the bilinear interpolation and remapping methods, respectively. P METH = P-grid forecast interpolation method; Threshold = precipitation threshold [mm (24 h)^{−1}]; Δ*a* = N-grid hit difference after BIA adjustment; Δ*b* = N-grid false alarm difference after BIA adjustment; Δ*c* = N-grid miss difference after BIA adjustment; Δ*d* = N-grid correct no-rain forecast after BIA adjustment; and Δ*a _{r}* = N-grid random forecast hit difference after BIA adjustment

The N-grid category relative differences after BIA adjustment to P-grid BIA. Results are shown for the bilinear interpolation and remapping methods, respectively. P METH = P-grid forecast interpolation method; Threshold = precipitation threshold [mm (24 h)^{−1}]; Δ*a/a* = relative hit difference (%); Δ*b/b* = relative false alarm difference (%); Δ*c/c* = relative miss difference (%); Δ*d/d* = relative correct no-rain forecast (%); and Δ*a** _{r}*/

*a*= relative random forecast hit difference (%)

_{r}QBOLAM-forecast BIA scores on the native grid and on the postprocessing grid, using either bilinear interpolation or remapping

As in Table 4 but for ETS

As in Table 4 but for HK

Mean total precipitation differences [mm (24 h)^{−1} ] and relative mean differences for the four areas shown in Fig. 2, calculated over 243 days. The differences are computed for each area subtracting the N-grid daily total precipitation from the P-grid daily total precipitation, interpolated either with bilinear interpolation or remapping. The relative mean difference is the mean difference normalized by the mean of the daily simple averages of the N-grid and P-grid precipitation totals (%)

^{1}

The conditional false alarm rate (F) must not be confused with the false alarm ratio (FAR; Mason 1989). The first score is the frequency of yes forecasts when the events do not occur, while the second one is the ratio between the number of false alarms and the total number of yes forecasts.