## Abstract

A new verification method is proposed to test the consistency of ensemble high-resolution precipitation fields forecasted by calibrated downscaling models. The method is based on a generalization of the verification rank histogram and tests the exceedance probability of a fixed precipitation threshold calculated from the observed or ensemble fields. A graphical tool that accounts for random assignments of the rank is proposed to provide guidance in histogram interpretation and to avoid a possible misunderstanding of model deficiencies. The verification method is applied on three numerical experiments carried out in controlled conditions using the space–time rainfall (STRAIN) downscaling model with the aims of investigating (i) the effect of sampling variability on parameter estimation from the observed fields and (ii) model performance when calibration relations between the parameter and a coarse meteorological observable are used to interpret events arising from one or more physical conditions. Results show that (i) ensemble members generated using the parameters estimated on the observed event are overdispersed; (ii) the adoption of a single calibration relation can lead to the generation of consistent ensemble members; and (iii) when a single calibration relation is not able to explain observed event variability, storm-specific calibration relations should be adopted to return consistent forecasts.

## 1. Introduction

The meteorological and hydrological scientific communities have aimed to develop hydrometeorological forecasting systems for streamflow predictions using ensemble techniques to account for all the sources of uncertainty (Schaake et al. 2007). Advanced ensemble hydrometeorological forecasting systems include the combined use of meteorological and hydrological models as well as of statistical downscaling models, land surface models, and data assimilation systems. The use of such tools allows the simulation of weather forcing ensembles used as inputs to hydrological models for the generation of streamflow ensembles.

Hydrometeorological forecasting systems are required to be accurately verified through the prediction steps. At the moment, however, few studies have developed specific verification techniques for forecasting systems. In most cases, the verification of ensemble hydrometeorological forecasts has been limited to the qualitative comparison between the observed and the ensemble hydrographs using simple scalar measures for few events. Such approaches are not able to provide an accurate and statistically based verification. By limiting the verification to streamflow hydrographs, the uncertainty of internal steps cannot be evaluated. Further, when a scarce number of events are used, it is not possible to infer information about system performance in other conditions. Finally, the lack of a rigorous statistical framework prevents assessing if the ensemble forecasts and observations are equally likely from the statistical point of view. Notable exceptions are the studies of Georgakakos et al. (2004) and Carpenter and Georgakakos (2004), who proposed a rigorous statistical characterization of the uncertainty of ensemble streamflows simulated by distributed hydrological models. Nevertheless, their studies were focused on the assessment of the uncertainty of the hydrological response for different input parameters rather than on the evaluation of a forecasting system performance.

These considerations underline the need for rigorous and specific verification methods for sources of uncertainty in a hydrometeorological forecasting scheme. In this paper, we develop a verification methodology for one of its elements, the statistically based precipitation downscaling model. These models can be utilized in hydrometeorological prediction systems since they are able to bridge the gap between the coarse scales resolved by numerical weather prediction (NWP) models and the finer ones required by hydrological modeling. Starting from precipitation predicted by NWP models at a large space–time scale, downscaling models are able to produce a set of equiprobable spatiotemporal precipitation fields at high resolution with the same statistical properties. This set of downscaled rainfall fields will be referred throughout the paper as an ensemble forecast. We highlight that the term ensemble forecast has been widely used in atmospheric science to indicate a set of numerical forecasts generated by adopting several kinds of perturbation methods. Among these, we mention the adoption of different initial conditions, different model physics or physical perturbations, different models, and/or the use of differing fixed fields and constants (Hamill 2001). In the case of the statistical downscaling model here adopted, the perturbations are obtained by a Monte Carlo approach that allows simulation of possible organizations of small-scale precipitation mechanisms.

The operative use of downscaling models in hydrometeorological prediction systems is achieved by means of calibration relations linking their few parameters with one or more meteorological observables at coarse scale, provided by NWP models. For instance, the convective available potential energy (CAPE) or the precipitation volume at the large scale have been adopted in previous works by Over and Gupta (1994, 1996), Perica and Foufoula-Georgiou (1996), Deidda (2000), and Deidda et al. (2004) to calibrate parameters of different downscaling models. The quality of the large-scale NWP forecast is, then, one of the factors affecting the uncertainty of downscaling models.. In general, at the coarse-scale domain selected in downscaling model applications, we presumably expect high-quality NWP predictions. Nevertheless, we highlight that a quantification of NWP forecast uncertainty at different spatiotemporal aggregations would be useful for the purpose of downscaling model application, but very few efforts have been devoted to similar analyses. In this paper, we consider only uncertainty inherent to the downscaling process and we do not assume any uncertainty of the coarse-scale NWP forecast, which is considered here a deterministic quantity.

Ensemble techniques have been widely adopted in numerical meteorological modeling in last years (Toth and Kalnay 1993, 1997; Molteni et al. 1996; Houtekamer et al. 1996). Consequently, a consistent research effort has been carried out to develop specific verification methods of ensemble models as reviewed by Wilks (2006). Since ensemble forecasts allow the production of probabilistic predictions, a first group of verification techniques includes all the methods usually used to verify probabilistic forecasts. These techniques may be scalar metrics, such as the Brier score and its decomposition (Brier 1950; Murphy 1973) and the ranked probability score (Epstein 1969; Murphy 1971), or nonscalar metrics, such as the reliability diagram (Wilks 2006) and relative operating characteristics (ROC) (Swets 1973; Mason 1982). Other verification tools have been developed for ensemble models outputs to test the consistency hypothesis (Anderson 1997), that is, the degree to which the observed state is a plausible member of the forecast ensemble. For a single scalar output, the most popular technique is the verification rank histogram, proposed independently by Anderson (1996), Hamill and Colucci (1997), and Talagrand et al. (1997). More recently, the concept of the rank histogram has been extended into the minimum spanning tree, which tests the consistency hypothesis of multidimensional model outputs (Smith and Hansen 2004; Wilks 2004).

The verification technique proposed in this paper is based on a generalization of the verification rank histogram (VRH) and is aimed at evaluating the consistency condition for a single statistical variable calculated from each spatiotemporal precipitation field. Such a variable is the exceedance probability of a fixed precipitation threshold *i**. The selection of this predictand was motivated by the following reasons: The purpose of downscaling models is not to forecast a precipitation value in a specific location and time, but to provide the statistical characterization of precipitation at high resolution within the coarse-scale domain. As a consequence, we cannot evaluate, for each spatial location and time interval, the exact correspondence between forecasted and observed precipitation values, but we should verify how the model simulates the statistical properties of the precipitation field. The precipitation exceedance probability has been assumed, for this purpose, as the predictand variable. In addition, this variable can be calculated for different values of the threshold *i**, and the application of the verification procedure may be repeated to test the downscaling model capability of simulating precipitation statistical properties in different conditions.

The proposed verification method is developed and applied using a multifractal downscaling model of precipitation proposed by Deidda et al. (1999) and refined in Deidda (2000). Nevertheless, the study results are general enough to also be considered valid for other downscaling models. Three numerical experiments are carried out on synthetic spatiotemporal precipitation fields that are then verified through the proposed procedure. The experiments are used to investigate two key aspects of the downscaling model to demonstrate how consistent members may be generated: (i) the effect of sampling variability on parameter estimation from the observed precipitation fields and (ii) downscaling model performance when calibration relations are used to interpret the spread of parameter estimates.

In section 2, the main features of downscaling schemes and multifractal theory are provided, highlighting the aspects useful to understanding the verification procedure. In section 3 the verification procedure is described in detail, while section 4 presents the results of the three numerical experiments. Finally, in section 5 we discuss the study conclusions.

## 2. Precipitation downscaling model

The aim of this section is to briefly describe some theoretical aspects and applications of precipitation downscaling models to clarify how the verification procedure developed in this paper can be applied on the ensemble outputs provided by such models.

Let us first summarize a general downscaling scheme. Suppose that a precipitation measure *μ*(*D*) or its probability distribution *P _{D}*(

*μ*) in a domain

*D*in ℜ

^{3}(where two dimensions are in space and one in time) is observed or simulated by an NWP model. The goal of a downscaling scheme is to determine the probability distribution

*P*(

_{δD}*μ*) of the measure

*μ*(

*δD*) over a finer region

*δD*by analyzing and then reproducing statistical properties of the measure

*μ*observed over different scales between

*D*and

*δD*. In a hydrometeorological forecasting system, the domain

*D*is a coarse spatiotemporal region (see large cube in Fig. 1) at which NWP models provide forecasts with low uncertainty, while the domain

*δD*is a fine spatiotemporal region (smaller cubes of Fig. 1) required by a hydrological model.

Characterization of precipitation statistical properties at different scales has been carried out through multifractal theory (e.g., Lovejoy and Mandelbrot 1985; Schertzer and Lovejoy 1985; Gupta and Waymire 1993; Over and Gupta 1996; Perica and Foufoula-Georgiou 1996; Deidda 2000). This approach requires the presence of scale-invariance laws defined as

where 〈 〉 denotes an ensemble average operator and *q* is a real number. If Eq. (1) is verified over a range of scales, the measure *μ* is said to be scale invariant and if the exponent *ζ*(*q*) is a nonlinear function of *q*, the measure is multifractal. If space–time precipitation fields display scale-invariant and multifractal properties, they can be modeled by means of a stochastic multiplicative cascade dependent on a few parameters. To investigate precipitation scale invariance, we have to assume some kind of relation between statistically coherent scales in space and time (i.e., space–time self-similarity or self-affinity) and the possible presence of spatial heterogeneities induced, for example, by orographic constraints.

Let us first discuss the case of rainfall fields displaying homogeneous properties in space. Self-similarity (or scale isotropy) represents the simplest circumstance where the scaling law [Eq. (1)] holds true in multidimensional regions. In such a situation, a linear relationship *λ* = *Uτ* is assumed to hold between coherent space scales *λ* and time scale *τ*, where the same statistical properties may be observed (Deidda 2000; Deidda et al. 2004). The scale-independent parameter *U* allows transferring the statistical properties observed at space scales *λ* to coherent time scales *τ* = *λ*/*U*. We can therefore analyze rainfall variability in isotropic and homogeneous three-dimensional regions, where space–time scale invariance can be investigated by introducing the following measure:

where *r*(*x*, *y*, *t*) is the rainfall rate in (*x*, *y*) location at time *t*, and indexes *i*, *j,* and *k* identify the spatial and temporal position of each subregion *λ* × *λ* × *τ* in the grid partition.

Thus, assuming that subregions *λ* × *λ* × *τ* correspond to *δD* in Eq. (1), we can write the following power law of proportionality between the partition function *S _{q}*(

*λ*) and the scale

*λ:*

where 〈 〉 denotes an ensemble average over all the boxes *λ* × *λ* × *τ*, indexed by *i*, *j*, and *k* in the *λ* partition. Multifractal exponents *ζ*(*q*) can be estimated by plotting *S _{q}*(

*λ*) versus

*λ*in a log–log space.

In the more general case of self-affine measures, scale-invariance laws can be investigated under anistropic space–time transformations: *λ* → *λ*/*b _{s}*,

*τ*→

*τ*/

*b*, where different branching numbers in space (

_{t}*b*) and time (

_{s}*b*) are adopted. This approach, known as generalized scale invariance (GSI) (Lovejoy and Schertzer 1985; Schertzer and Lovejoy 1985), characterizes the degree of anisotropy by the scaling anisotropy exponent

_{t}*H*relating branching numbers as

*b*

_{t}=

*b*

^{(1−H)}

_{s}. Scale invariance can thus be investigated in such self-affine measures by introducing in Eq. (2) a scale parameter

*U*=

_{λ}*U*(

_{L}*λ*/

*L*)

*, implying that the linear relationship between coherent space and time scales does not hold:*

^{H}*τ*=

*λ*/

*U*∝

_{λ}*λ*

^{(1−}

^{H}^{)}. This general approach also contains the self-similar case for

*H*= 0 (implying

*b*≡

_{s}*b*and

_{t}*U*constant). Using simpler words, a self-similar transformation implies that the same rule is applied to contract both space and time domains to perform scale-invariance analyses (e.g., when a space scale is divided in two parts, the time scale is also divided in two parts). In contrast, with the self-affine transformation, a different rule is applied in space and time (e.g., when space scales are divided by two, the time scales are divided by three). For more details on these aspects, the reader is referred to Deidda et al. (2004).

We highlight that in the case of spatial homogeneity, both the simpler self-similar or the more general self-affine transformations assume that probability distribution of rainfall rates is the same in each subregion *λ* × *λ* × *τ*, regardless of the grid-cell position in space and/or in time. Thus, although the grid partitioning is slightly different for the self-similar and the self-affine cases, the verification procedure described later can be applied merging all the rainfall values observed or generated on all the *λ* × *λ* × *τ* grid-cells.

On the other hand, if rainfall fields display spatial heterogeneity, the probability distribution of rainfall rates in subregions *λ* × *λ* × *τ* may depend on the spatial location. Thus, in principle, the verification procedure should be applied separately by merging together rainfall rates observed or generated at different times in each spatial verification location. In the case that spatial heterogeneity is only due to differences in the spatial rainfall mean, observed fields may be homogenized by means of a modulating function (Badas et al. 2006). This allows the verification procedure to be applied by merging rainfall rates in all the *λ* × *λ* × *τ* grid-cells as in the homogeneous case.

Whatever the scale transformation rule holds for the analyzed rainfall field, the scale-invariance analysis provides a set of multifractal exponents *ζ*(*q*) that can be used to estimate parameters of the adopted downscaling model. In this paper, we apply the space–time rainfall (STRAIN) downscaling model (Deidda et al. 1999; Deidda 2000) to test the verification procedure here proposed in a homogeneous and scale-isotropic framework, but the method may be also applied in case of heterogeneity and self-affinity, according to the transformations previously described.

The model is based on a log-Poisson generator *η* = *e ^{A}β^{y}*, where

*β*is a parameter,

*y*is a Poisson random variable with mean

*c*, and

*A*=

*c*(1 −

*β*) is a renormalization constant. The model provides theoretical value for the multifractal exponent

*ζ*(

*q*) that allows simulating the multifractal properties in real-world precipitation events. The expression

*ζ*(

*q*) in a

*d*-dimensional domain (

*d*= 3 in a spatiotemporal domain) is given by

Equation (4) is used to estimate the parameters *c* and *β* from sample multifractal exponents *ζ*(*q*) coming from Eq. (3) for each observed event. To make downscaling models operatively usable in hydrometeorological systems, many authors have proposed simple calibration relationships between model parameters and one or more meteorological observable at a coarse scale that can be presumably predicted with low uncertainty by NWP models. In recent studies using radar data, the STRAIN model was calibrated and applied to reproduce observed scale-invariant properties (Deidda 2000; Deidda et al. 2004). These studies revealed that *β* can be assumed constant as *e*^{−1}, while *c* was found to decrease as the mean precipitation rate at the coarse scale increases. This behavior was interpreted by the following relationship:

where *R* is the precipitation rate at the coarse scale *L* × *L* × *T* and *c*_{∞}, *a*, and *γ* are the parameters of the nonlinear equation. Based on the mean precipitation rate *R* at the large scale *L* × *L* × *T* obtained from an NWP model output, this relation can be used to estimate *c*. After parameter estimation, the STRAIN model can generate an ensemble of precipitation fields at high resolution *λ* × *λ* × *τ*, which represents the equiprobable small-scale scenarios corresponding to the same coarse-scale condition, as depicted in Fig. 1.

## 3. The verification method

In this section, we illustrate the proposed ensemble verification procedure for precipitation downscaling models. We first provide an overview of the verification rank histogram technique followed by the determination of the precipitation exceedance probability of downscaled precipitation fields. We then describe how the VRH is built for the exceedance probability. Finally, we discuss the presence of randomly assigned ranks that artificially affect histogram shape and propose a graphical method for their interpretation.

### a. Verification rank histogram

Ensemble forecasts can be generated by model runs for different conditions and/or using Monte Carlo methods to capture modeling uncertainties. In ideal conditions, ensemble forecasts should represent a random sample of the probability distribution of the predictand while the actual state is one possible member of this distribution (this is referred to as the consistency hypothesis). In practice, simulation conditions are not randomly sampled and the forecast model is not perfect. Therefore, an important aspect of ensemble forecast verification is to evaluate the degree to which the actual state is a plausible member of the ensemble forecasts.

The verification procedure here presented for the ensemble precipitation fields generated by downscaling models is based on the VRH, a graphical tool used for a single scalar or univariate predictand. Each downscaled precipitation field is a multivariate variable with high dimensionality and a full verification of such ensemble members is challenging. The use of a simpler procedure for univariate variables like the VRH makes the verification more feasible from the computational, understanding, and interpretive points of view. In addition, to limit the loss of information consequent to the use of a verification method for a single scalar variable, we have adopted the exceedance probability of a fixed precipitation threshold as predictand. This implies that we may select different values of the threshold and repeat the verification procedure to test the downscaling model in different conditions of interest for hydrometeorological applications.

The underlying idea of VRH is rather simple. Let *S* be the univariate variable to be forecasted. For each event, the ensemble model provides *N*_{ens} forecasts *S*_{1}, *S*_{2}, . . . , *S*_{Nens} to predict the corresponding observation *S*_{obs}. If forecasts and observations are drawn from the same distribution, the rank of *S*_{obs} within the sorted vector **S** containing *S*_{1}, . . . , *S*_{Nens} and *S*_{obs} assumes equally likely the values of 1, 2, . . . , *N*_{ens} + 1. Therefore, if *N*_{ev} events are analyzed and ranked, the histogram built with these ranks should be uniform. Any departure from uniformity should only be due to sampling variability.

If the VRH is not uniform, the assumptions underlying the ensemble forecasts have not been met. In particular, positive or negative biases produce overpopulation of the lowest or highest ranks; an excess of dispersion (overdispersion) implies overpopulation of the middle ranks, while a lack of variability (underdispersion) determines U-shaped histograms (Hamill 2001). Moreover, particular care should be made for the interpretation of VRH, since a uniform rank histogram is a necessary but not sufficient condition for consistency. Indeed, the same histogram shape can be obtained by combining the effects of different ensemble model deficiencies, making the diagnosis of ensemble forecast characteristics difficult. More details on this aspect can be found in Hamill (2001).

The VRH provides a measure of reliability or conditional bias of the forecast, but it does not furnish a full evaluation of forecast performance. For example, it is not able to give information about the refinement or sharpness of ensemble forecasts, that is, the degree to which the forecasts are sorted into classes, irrespective of the value of the observation (Wilks 2006).

Applications of the VRH to meteorological model outputs are provided by Hamill and Colucci (1997, 1998), who tested the reliability of single scalar outputs predicted by the Eta–Regional Spectral Model (RSM) short-range model.

### b. Construction of verification rank histogram to test precipitation exceedance probability

Before describing the generalization of the VRH, it is worthwhile to summarize the meaning of some terms used in the verification of precipitation downscaling models. The event to be forecasted is the high-resolution rainfall field for which we know the total accumulated volume in a coarse spatiotemporal domain *L* × *L* × *T*. The downscaling model provides an ensemble forecast of *N*_{ens} spatiotemporal precipitation fields at the small scale *λ* × *λ* × *τ*. Each ensemble member represents a possible statistical realization of the rainfall field and includes a total of *M* precipitation values at high resolution (the small cubes depicted in Fig. 2a). For example, in the case of a space–time downscaling model based on a binary cascade (like the STRAIN model), every father generates 2^{3} sons at each downscaling level, meaning that for *N*_{lev} downscaling levels, *M* is equal to (2^{3})^{Nlev}. Thus, given an observation of the *M* rainfall values in each small cube of Fig. 2a, the verification should be carried out by statistical comparison with the ensemble forecast of *N*_{ens} high-resolution downscaled rainfall fields.

Let us assume isotropy in space and time, so that the exceedance probability *S*(*i**) = Pr{*I* > *i**} of a fixed threshold *i** can be calculated from the entire set of the *M* high-resolution precipitation values in each *λ* × *λ* × *τ* grid cell (Fig. 2a). *S*(*i**) can be derived by the empirical survival function (ESF) (Evans et al. 2000) of the rainfall rates *i _{j}* in each grid cell

*j*, (

*j*= 1, . . . ,

*M*), without reference to their position in the cube. The ESF can be estimated using the ranks of the order statistics

*i*

_{(j)}(parentheses indicate the sorted sample) as the complement of the empirical cumulative frequency

*F*[

*i*

_{(j)}]:

where *F*[*i*_{(j)}] is estimated through the Hazen plotting position formula (Wilks 2006, p. 41). According to this formula, the interval (0, 1) is divided in *M* equal intervals and *F*[*i*_{(j)}] is the midpoint of the *j*th interval.

As shown in Fig. 2b, the exceedance probability *S*(*i**) corresponding to a generic precipitation threshold *i** is computed by linear interpolation of the two closest values of *S*[*i*_{(j)}], when *i*_{(1)} ≤ *i** ≤ *i*_{(M)}, while it is set to 1 or 0 when *i** is smaller than *i*_{(1)} or greater than *i*_{(M)}. In the case of heterogeneous rainfall fields that cannot be homogenized by a modulating function, the ESF should be built only with rainfall rates in the verification location.

The construction of a VRH for the exceedance probabilities of a fixed threshold *i** is straightforward. For each event to be forecasted, (*N*_{ens} + 1) ESFs can be built (i.e., *N*_{ens} from the ensemble members and one from the observed or verification event). The correspondent (*N*_{ens} + 1) exceedance probabilities of the selected threshold *i**, *S*_{1}(*i**), *S*_{2}(*i**), . . . , *S*_{Nens}(*i**), and *S*_{obs}(*i**) can be calculated and sorted in increasing order in a vector indicated with **S**. Finally the position of *S*_{obs}(*i**) in the vector **S** is tabulated. This procedure is repeated for all the *N*_{ev} events obtaining *N*_{ev} ranks. In previous applications, the VRH is constructed by plotting the integer ranks from 1 to (*N*_{ens} + 1). Here, we modify this approach by introducing a normalized rank *r* defined as the empirical cumulative frequency of *S*_{obs}(*i**) in the sample **S**, estimated using Hazen plotting position formula

where *p* is the position of *S*_{obs}(*i**) within the sample **S**. The use of this plotting position formula assures that the probability values assigned to the normalized rank are uniformly distributed in the interval (0, 1) regardless of *N*_{ens}. Although this modification does not lead to a conceptual difference with the standard procedure, the normalized rank allows immediate comparison among VRHs with different *N*_{ens}. In addition, it permits keeping track of the portion of the VRH that has been randomly assigned according to the graphical method proposed in the following subsection.

From a fixed precipitation threshold *i**, it is possible to build a VRH testing the exceedance probability of that threshold. The procedure can be repeated to test several precipitation thresholds of interest in the hydrometeorological forecasting system.

### c. Rank assignment and histogram interpretation

For high values of the threshold *i**, it is possible that the exceedance probabilities for the observed event and for one or more ensemble members are equal to zero. In such a case, the position *p* of *S*_{obs}(*i**) in the vector **S** is randomly assigned as suggested by Hamill and Colucci (1998). When this occurs for several events, the VRH will be populated by many random values and its shape will be artificially affected, disguising the presence of model forecast deficiencies, if present. Therefore, in this subsection, we first discuss when the position *p* of *S*_{obs}(*i**) can be unequivocally or randomly assigned and then we propose a graphical method providing guidance for histogram interpretation.

The unequivocal or random assignment for the position *p* is illustrated in Fig. 3, where each panel compares the ESF of the observed precipitation field (in black) with the ensemble ESFs of synthetic fields (in gray) predicted by downscaling models with different forecasts skills. An important issue that affects the determination of *p* is the value of *i**, which can be smaller or greater than the maximum observed precipitation value. For this reason the figure is divided into two groups where different precipitation thresholds *i**_{a} and *i**_{b} are used:

In Figs. 3a–c, the precipitation threshold

*i**_{a}is smaller than the maximum observed precipitation value, so that*S*_{obs}is always greater than 0 and its position*p*within the vector**S**is unequivocally determined in all three cases. In particular, in Fig. 3a,*S*_{obs}is included between*S*_{min}and*S*_{max}, which are the minimum and maximum exceedance probabilities of the ensemble members [i.e., 1 <*p*< (*N*_{ens}+ 1)]. In Figs. 3b,c,*S*_{obs}occupies the first and the last position in the sorted vector**S**[i.e.,*p*= 1 and*p*= (*N*_{ens}+ 1)].In Figs. 3d–f, the precipitation threshold

*i**_{b}is greater than the maximum observed precipitation value, so that*S*_{obs}is always equal to 0. In this case, if one or more exceedance probabilities of the ensemble members is also equal to 0, the position*p*in**S**is randomly assigned following a similar approach as Hamill and Colucci (1998). To better illustrate this, let us indicate with*N*_{0}the number of ensemble ESFs for which the exceedance probability is 0. In Fig. 3d,*N*_{0}=*N*_{ens}and**S**is a sequence of zero values. The position*p*of*S*_{obs}in vector**S**randomly takes one of the integer values 1, 2, . . . , (*N*_{ens}+ 1). In Fig. 3e, 1 ≤*N*_{0}<*N*_{ens}, so that**S**contains a sequence of (*N*_{0}+ 1) elements equal to zero and (*N*_{ens}−*N*_{0}) values greater than 0. The position*p*of*S*_{obs}is again randomly assigned, but within the limited number of integers 1, 2, . . . ,*N*_{0}+ 1. Finally, in Fig. 3f,*N*_{0}= 0 and*S*_{obs}is unequivocally placed in the first position of**S**, being the only element equal to zero.

Note that the position may also be randomly assigned in the cases depicted in Figs. 3a–c when *S*_{obs} > 0 and other ensemble exceedance probabilities have the exact value as *S*_{obs} (Hamill and Colucci 1998), but this rarely occurs and thus does not affect the shape of the resulting histograms. For this reason, we do not consider random assignments occurring when *S*_{obs} > 0.

We remark that for small values of *i**, the ranks are usually unequivocally determined as in Figs. 3a–c. As the threshold *i** increases, the chance of encountering *S*_{obs} = 0 and some *S _{j}* = 0 in the ensemble members increases (Figs. 3d,e). Thus, depending on

*N*

_{0}values, a smaller or larger portion on the left side of the VRH will be randomly populated, making the detection of model forecast deficiencies more difficult. Therefore, it is convenient to store the occurrence of unequivocal and random assignments for the whole set of

*N*

_{ev}events used to verify the model.

For this purpose, we can associate with the VRH the empirical cumulative density function (ECDF) of a variable *r̃*_{k} calculated for each forecasted precipitation event *k* = 1, . . . , *N*_{ev} and defined as

The variable *r̃*_{k} is set to 0 when the rank is determined in an unequivocal way: this happens either if *S*_{obs}(*i**) > 0, irrespective of the values of the ensemble members (Figs. 3a,b,c), or if *S*_{obs}(*i**) = 0 and all the exceedance probabilities of the ensemble members are nonzero, thus *N*_{0} = 0 (Fig. 3f).

On the contrary, *r̃*_{k} assumes a positive value when *S*_{obs}(*i**) = 0 and there is at least one ensemble member with a zero exceedance probability of precipitation (1 ≤ *N*_{0} ≤ *N*_{ens}), so that a random assignment occurs (Figs. 3d,e). In this case, *r̃*_{k} is defined as the cumulative frequency of the (*N*_{0} + 1) zero value in vector **S**, since, accordingly to Eq. (7), the normalized rank *r* will be randomly determined within the interval (0, *r̃*_{k}). As *r̃*_{k} increases, the interval (0, *r̃*_{k}) becomes wider until it reaches the whole range (0, 1).

The ECDF of the sample *r̃*_{1}, *r̃*_{2}, . . . , *r̃*_{Nev} provides a graphical summary of how the normalized rank has been assigned on the entire set of *N*_{ev} precipitation events. Figure 4 shows how the ECDFs of *r̃*_{k} may change as the precipitation threshold *i** increases. For lower values of *i**, no random assignment occurs, thus all the *r̃*_{k} are equal to 0 and the ECDF represents an impulse concentrated on 0 (case A). As *i** increases, part of the ranks may be randomly assigned and the ECDF contains some of the *r̃*_{k} values equal to 0 and the others greater than 0 (cases B and C). In particular, in ECDF B, the normalized ranks are randomly assigned in intervals (0, *r̃*_{k}) with *r̃*_{k} < 1, while in ECDF C there are also some *r̃*_{k} = 1, meaning that the corresponding ranks are randomly assigned in the whole interval (0, 1). If *i** further increases, all the *N _{eυ}* ranks may be randomly determined and the corresponding

*r̃*

_{k}result is always greater than 0 (cases D and E). ECDF D refers to the situation where part of the ranks are randomly assigned in intervals (0,

*r̃*

_{k}) with

*r̃*

_{k}< 1 and part in the whole interval (0, 1), while ECDF E represents the extreme case where all the ranks are randomly determined in the interval (0, 1). Although the VRH is populated only by uniformly distributed random values, this last case indicates that precipitation values at the finescale greater than or equal to the threshold

*i** do not represent a critical situation (i.e., a zero exceedance probability) for the available observations and ensemble members.

The use of ECDFs of *r̃*_{k} permits us to evaluate how much the shape of the histogram depends on real model forecast characteristics and how much it is artificially affected by random assignment. VRHs and ECDFs of *r̃*_{k} obtained for the different thresholds should be interpreted together to better detect forecast deficiencies, if any. This aspect is illustrated in the following section, where the verification procedure is applied and tested on synthetic precipitation fields generated by the STRAIN multifractal model.

## 4. Results and discussion

The verification procedure is applied on three series of experiments based on space–time rainfall events generated by the STRAIN model under controlled conditions, with the aim to understand and interpret the verification results, to evaluate its capability in detecting downscaling model deficiencies when different calibration approaches are adopted, and to demonstrate how consistent ensemble members can be generated. Numerical experiments are carried out in the following way: first, we generate a set of *N*_{ev} precipitation events applying the STRAIN model with selected values of the *c* and *β* parameters. Although synthetically generated, the *N*_{ev} events are assumed to be “observed events” and are subsequently hindcasted according to different calibration approaches for the selection of STRAIN model parameters. Finally, hindcast performances are evaluated by means of the proposed verification procedure.

In experiment 1 (constant parameters), the observed events are generated with the constant parameters *c*_{0} and *β*_{0} with the aim of analyzing the influence of intrinsic model sampling variability in parameter estimation. In experiment 2 (single calibration relation), observed events are generated from a calibration relation between STRAIN parameters and precipitation rate at the coarse scale, like those presented in previous works and previously mentioned in the paper. Finally, experiment 3 (multiple calibration relations) is built by mixing observed events coming from two different calibration relations, with the objective of mimicking events originating from different meteorological conditions.

### a. Experiment 1: Constant parameters

Experiment 1 allows us to show how the ensemble downscaled fields can be affected by overdispersion if intrinsic model sampling variability is not taken into account in parameter estimation.

A set of *N*_{ev} = 400 high-resolution precipitation events is generated through a Monte Carlo approach by downscaling a coarse precipitation rate of 1 mm h^{−1} over five downscaling levels with fixed values of STRAIN model parameters *c*_{0} = 0.7 and *β*_{0} = *e*^{−1}. Thus, each event is drawn from the same distribution and the downscaling levels vary, for example, from a coarse scale with *L* = 128 km and *T* = 320 min to a finescale with *λ* = 4 km and *τ* = 10 min (Deidda et al. 2004). These 400 events are assumed to be the set of observed events. Subsequently, we do not assume any knowledge of the method used to generate these observed events and we use them first to calibrate STRAIN parameters for the hindcasting phase and then as verification for the ensemble hindcasting members.

The STRAIN parameters are estimated for each observed event in the following way: first, we compute partition functions *S _{q}*(

*λ*) with Eq. (3) for different

*λ*scales and

*q*moments. Second, sample multifractal exponents

*ζ*(

*q*) are estimated by the slope of the linear regression between log

*S*(

_{q}*λ*) and log

*λ*, for each moment

*q*. Finally, the

*c*parameter is estimated by fitting Eq. (4) to sample multifractal exponents

*ζ*(

*q*), while the

*β*parameter is kept constant at

*e*

^{−1}(Deidda 2000; Deidda et al. 2004).

Let *c*^{est}_{k} be the estimate of the *c* parameter on the *k*th observed event. Because of sampling variability, the 400 different *c*^{est}_{k} estimates result in a Gaussian-like distribution around a mean value *c*_{mean} close to *c*_{0} (i.e., a quasi-unbiased estimator).

To show the importance of accounting for intrinsic model sampling variability, two approaches called “event-based” and “mean-based” calibration modes for determining the downscaling parameters are compared. The event-based calibration mode is derived from the notion that *c* should be estimated from the same observed event, since at first sight, this appears the most obvious approach to simulate each event. As a result, for each event *k*, the parameter *c*^{est}_{k} is used to generate the ensemble members. Note that this calibration mode can only be used for hindcasts. If a forecast is required, *c*^{est}_{k} is unknown and the parameter should be determined from past events. In contrast, the mean-based calibration mode is based on the average behavior of the entire set of events using the same parameter *c*_{mean} and thus it is suitable for forecasts.

In both cases, *N*_{ens} = 100 ensemble members are simulated to hindcast each observed event (total of 40 000 synthetic fields for each approach) and VRHs are constructed for thresholds of 10, 15, 20, 25, 30, and 35 mm h^{−1} (selected to span the potential range of ECDFs of *r̃*_{k} behavior). Results are shown in Figs. 5 and 6 for the event-based and mean-based calibration modes, respectively. Each panel contains the VRH for a precipitation threshold, plotted using 10 bins to group the 400 ranks, and the respective ECDF of *r̃*_{k}. To distinguish between true deviations from uniformity and sampling variations, the 5%, 25%, 50%, 75%, and 95% quantiles of a uniform distribution are plotted using horizontal lines.

The following results can be summarized for the event-based calibration mode (Fig. 5). For the lowest threshold (*i** = 10 mm h^{−1}), the ECDF of *r̃*_{k} is concentrated on zero, implying that ranks have been unequivocally determined, while the histogram is more populated in the middle ranks (i.e., overdispersed forecasts). As the precipitation threshold increases (*i** = 15, 20 mm h^{−1}), the number of nonzero values increases leading to random assignments in the interval (0, *r̃*_{k}), where 0 < *r̃*_{k} ≤ 1. The small ranks in the histograms are artificially more populated, but overdispersion is still visible. When *i** further increases, (*i** = 25, 30 mm h^{−1}), both the number and magnitude of nonzero *r̃*_{k} values increase so that several ranks are randomly assigned and the interval length (0, *r̃*_{k}) becomes wider. This implies that even the high ranks are randomly populated. As an extreme case, when the *i** is higher than observed and synthetic precipitation (*i** = 35 mm h^{−1}), the standardized ranks are all randomly assigned in the interval (0, 1) leading to a uniform histogram.

In Fig. 6, results are shown for the verification of ensemble members generated with the mean-based calibration mode. In this case, the VRHs are uniform despite the expected sampling variability, whatever the value of the precipitation threshold *i**. This means that the consistency condition is respected.

The effects of overdispersion and uniformity in the histograms resulting, respectively, from the event-based and mean-based calibration modes can be explained as follows: the consistency condition requires that for every predicted event, observations and forecast ensemble behave like random draws from the same distribution. This requirement is satisfied in the mean-based mode because ensemble members are generated using the parameter *c*_{mean}, which is close to *c*_{0} (used to generate the observed events). In this situation, the ESFs of the observed events are equally likely as the ensemble hindcasts, leading to uniform VRHs for the exceedance probabilities.

When ensemble members are generated using the parameter *c*^{est}_{k} (event-based calibration mode), the consistency condition is not respected because ensemble and observations belong to different distributions and an effect of overdispersion is produced. This effect is explained by a “centering” of the variability around the event *k*. Thus, the ESF of the observed event *k* is placed in the center of the ensemble hindcast ESFs and the probabilities of exceedance occupy intermediate positions. For this reason, even if we were able to know the value of *c*^{est}_{k} in a forecasting framework, results show that the best choice is to generate the ensemble members using *c*_{mean}.

In conclusion, we highlight the importance of interpreting the VRHs, looking also to the ECDF of *r̃*_{k}. In fact, in the event-based hindcasts, when the lower thresholds are analyzed, the histogram shape is not artificially affected (most of *r̃*_{k} = 0) and overdispersion can be detected. As the threshold increases, the shape of the histograms becomes more uniform and overdispersion cannot be easily detected. In such cases, the ECDF of *r̃*_{k} informs us that the uniform shape has been artificially caused by random assignments of the rank.

### b. Experiment 2: Single calibration relation

Experiment 2 is designed to demonstrate how calibration relations between model parameters and a meteorological observable at coarse scale can take into account model sampling variability leading to consistent ensemble members.

In this case, we generate a set of observed events with different intermittency properties, adopting the calibration relation between STRAIN parameter *c* and the precipitation rate at the coarse scale *R* provided by Eq. (5). For purposes of this study, we select a calibration relation found to be valid by Deidda et al. (2004), where *c*_{∞} = 1.1, *a* = 0.85, and *γ* = 1.35. Parameter *β* was found to be fairly constant at *e*^{−1}. This calibration relation *c* = *c*(*R*) is plotted in Fig. 7 through a dashed black line. In this experiment, the set of observed events is generated using the calibration relation in the following way: first, 400 precipitation rates at the coarse scale *R _{k}* (

*k*= 1, . . . , 400) are randomly drawn, in an interval between 0.25 and 5 mm h

^{−1}, from an exponential distribution fitted to the

*R*values analyzed by Deidda et al. (2004) to mimic the occurrence of an observed large-scale event. Then, the set of parameters

*c*=

_{k}*c*(

*R*) is calculated using Eq. (5) and used to generate 400 high-resolution precipitation events through the STRAIN model with five downscaling levels.

_{k}As in experiment 1, given the observed events, we attempt to calibrate STRAIN parameters that will be used in the hindcasting phase, assuming no knowledge about their origin. For this purpose, parameter *c* is estimated on each observed event *k*. In Fig. 7, the 400 *c*^{est}_{k} estimates are plotted versus the *R _{k}* of the correspondent

*k*th event through dots. Because of intrinsic model sampling variability, each

*c*

^{est}

_{k}estimated on the event

*k*is different from the parameter

*c*=

_{k}*c*(

*R*) used for the generation of the event

_{k}*k*itself. In particular,

*c*

^{est}

_{k}estimates result in a Gaussian-like distribution around

*c*. The set of 400

_{k}*c*

^{est}

_{k}is then used to fit Eq. (5), obtaining the parameters

*c*

_{∞}= 1.1,

*a*= 0.82, and

*γ*= 1.29. This new calibration relation,

*c*=

*c*

^{cal}(

*R*), plotted in Fig. 7 with a black solid line, is very close to the calibration relationship

*c*=

*c*(

*R*) used for the generation of the observed events.

In experiment 2, hindcasts of the 400 observations are carried out according to the event-based and mean-based calibration modes. In addition, we test the “functional-based” calibration mode, where the ensemble members hindcasting each event *k* are generated using the parameter value *c*^{cal}_{k} = *c*^{cal}(*R*_{k}). In all modes, 100 ensemble members are generated to hindcast each observation and VRHs of the exceedance probabilities are then constructed for precipitation thresholds *i** = 50, 75, 100, 150, 250, and 500 mm h^{−1} (selected to obtain all the possible ECDF of *r̃*_{k} behaviors). Results are shown in Figs. 8 –10 for the event-based, mean-based, and functional-based calibration modes, respectively.

Results for the event-based calibration mode (Fig. 8) display similar behavior as detected in experiment 1 (overdispersion), because of the same centering effect. In the mean-based calibration mode (Fig. 9), the histograms for the lower thresholds (*i** = 50, 75, and 100 mm h^{−1}) are more populated in the lowest and in the highest ranks (*U* shaped), revealing an effect of forecast underdispersion. The histogram’s shape for the higher thresholds (*i** = 150, 250, and 500 mm h^{−1}) is affected by randomly assigned ranks. Finally, the functional-based calibration mode results (Fig. 10) display a uniform histogram shape for every precipitation threshold and the consistency condition is respected.

The effect of underdispersion for the mean-based calibration mode implies that ensemble members are more frequently close to each other and distant from the observation. Indeed, the sampling variability of the STRAIN model with a single parameter *c*_{mean} is not able to fully explain the variability of the 400 observed events. As a result, the ESF of several observed events will result far away from the corresponding set of ensemble hindcasting ESFs.

In contrast, when the parameter *c*^{cal}_{k} is used to generate the ensemble members (functional-based calibration mode), the consistency condition is achieved because observations and ensemble belong to the same distribution. In fact, the new calibration relation *c* = *c*^{cal}(*R*) is very close to *c* = *c*(*R*) used to generate the observed events, implying *c*^{cal}_{k} ≈ *c*_{k}, for each *k*th event.

In summary, this synthetic experiment was set to mimic the observed link between downscaling model parameters and coarse-scale rainfall rates. In this case, the functional-based calibration mode was the only one able to capture the downscaling model sampling variability and to generate consistent ensemble members.

### c. Experiment 3: Multiple calibration relations

Precipitation events can have different physical origins (e.g., convective or stratiform) and, in principle, one could expect that different calibration relations should be determined. Therefore, experiment 3 was designed to analyze the working of the VRH in this scenario. The set of observed events is generated starting from two different calibration relations *c* = *c*^{1}(*R*) and *c* = *c*^{2}(*R*), both described by Eq. (5) with paramaters *c*_{∞} = 1.13, *a* = 0.88, and *γ* = 1.49; and *c*_{∞} = 0.90, *a* = 0.55, and *γ* = 1.00 (Figs. 11a,b, respectively). We selected these calibration relations to mimic different precipitation mechanisms. However, this selection is used only to explain the potential impact of multiple relationships, and the values selected for *c*_{∞}, *a*, and *γ* are not representative of real climatologies, as studies investigating this phenomenon have not yet been conducted. A total of 400 observed events is generated using the STRAIN model with five downscaling levels starting from 400 precipitation rates at the coarse scale *R _{k}* drawn by the same exponential distribution simulating the large-scale event occurrence adopted in experiment 2. The set of 400 parameters

*c*is here determined by introducing 200

*R*values in calibration relationship

_{k}*c*=

*c*

^{1}(

*R*) and the other 200

*R*in calibration relationship

_{k}*c*=

*c*

^{2}(

*R*).

Assuming no a priori knowledge of the method used to generate the set of observed events, we then use these events to calibrate STRAIN parameters for the subsequent hindcasting phase, adopting the event-based, mean-based, and functional-based calibration modes.

In the event-based calibration mode, we adopt the parameters *c*^{est}_{k} estimated in each event *k* to generate the set of ensemble hindcasts. The 200 values *c*^{est}_{k} estimated from events generated by relations *c* = *c*^{1}(*R*) are plotted with circles versus *R _{k}* in Fig. 11a, while the 200 values

*c*

^{est}

_{k}coming from

*c*=

*c*

^{2}(

*R*) are plotted using asterisks in Fig. 11b. In the mean-based calibration mode, each observed event is hindcasted using the mean value

*c*

_{mean}of the 400

*c*

^{est}

_{k}estimates, plotted all together in Fig. 11c. In the functional-based calibration mode, we interpret the behavior of the estimates

*c*

^{est}

_{k}with respect to

*R*by estimating only a single calibration relationship

*c*=

*c*

^{cal}(

*R*), which ignores differences in precipitation type (solid line in Fig. 11c). This relation is then used to determine the set of 400 parameters

*c*

^{cal}

_{k}=

*c*

^{cal}(

*R*

_{k}) for the generation of the ensemble members.

For each calibration mode, 100 ensemble members are produced to hindcast each observation and VRHs of the exceedance probabilities are constructed for the same precipitation thresholds tested in experiment 2. Results of the verification procedure for the event-based and mean-based calibration modes (not shown) are very similar to those obtained in the second experiment.

Results of the functional-based calibration mode are shown in Fig. 12. Histograms for low *i**, whose shape is not affected by random assignment of the ranks are more populated in the lowest and the highest ranks (U shaped), revealing an effect of underdispersion. In particular, the observed events from calibration relations *c* = *c*^{1}(*R*) and *c* = *c*^{2}(*R*) cause a population mainly of high or low ranks, since the observed ESF is placed far away from the ensemble ESFs. Therefore, a downscaling model adopting the functional-based calibration mode based on a single relation is not able to fully interpret the proper variability of events drawn by different calibration relations.

In the case that different precipitation mechanisms lead to different calibration relations, ensemble consistency should be reached by classifying the available observed events according to their physical origin (e.g., stratiform or convective or on the basis of the synoptic pattern), and then by estimating different calibration relations able to simulate the variability of each family of events. Each observed event should then be forecasted using the storm-dependent calibration relation.

## 5. Conclusions

Advanced hydrometeorological forecasting systems for ensemble streamflow predictions are based on schemes that include the combined use of meteorological and hydrological models as well as downscaling models and data assimilation systems. The complexity of such schemes requires the creation and testing of rigorous verification methods to evaluate the sources of uncertainty involved.

In this paper, we have developed a verification method for one of the elements used in hydrometeorological systems, the precipitation downscaling models. Such models start from coarse scale information furnished by NWP models and provide ensembles of spatiotemporal precipitation fields at high resolution. Downscaling models based on the multifractal theory are able to reproduce observed intermittency and small-scale variability. Their operation is usually assured by calibration relations linking their few parameters with a coarse meteorological observable or predictand, such as the mean rainfall rate.

The verification procedure here proposed is based on the generalization of the verification rank histogram, a graphical device used in applied meteorology to test the consistency hypothesis (i.e., ensemble and observation are drawn from the same distribution) of univariate variables. Since downscaling models reproduce the probability of precipitation at high resolution, they cannot be verified in a deterministic way but should be tested by evaluating their ability in reproducing the statistical behavior of precipitation. Therefore, the univariate variable adopted here is the probability of exceedance of a fixed precipitation threshold *i** calculated from each spatiotemporal field.

The generalization of verification rank histograms has been performed as follows: first, a precipitation threshold *i** is fixed. Then, for each event, the exceedance probabilities of *i** are calculated for the *N*_{ens} ensemble and the observed fields and the position *p* of the observed exceedance probability is found. Finally, the rank histogram is built with the normalized ranks *r*, defined as the cumulative frequency of *p* computed for each verification event. The procedure is repeated for different values of *i**, spanning a range interesting for hydrometeorological applications. As *i** increases, the observed and ensemble exceedance probabilities can be equal to zero in a certain number of verification events, so that the ranks are randomly assigned. As a consequence, the histogram shape becomes artificially uniform, making the detection of model forecast deficiencies more difficult. To avoid a possible erroneous evaluation of model performance, we have introduced a graphical method based on the interpretation of the ECDF of a variable *r̃*_{k} accounting for random assignment.

The verification procedure has been applied and tested using the STRAIN downscaling model. The model depends on two parameters, *c* and *β*, and is able to simulate homogeneous precipitation fields in a self-similar framework. Three numerical experiments have been carried out in controlled conditions according to the following approach: the STRAIN model has been used first to generate “observed events” with selected values of parameters *c* and *β*. These observed events have been used to calibrate STRAIN parameters assuming no a priori knowledge of the method used to generate them. Finally, each observed event has been hindcasted using STRAIN according to three calibration modes for parameter determination: event based, mean based, and functional based. Results of the three experiments permit us to derive the following conclusions:

If we consider a hindcast framework and we generate the ensemble members adopting the parameter

*c*^{est}_{k}estimated on the same event*k*to be hindcasted (at first sight, the best possible solution to simulate the observed event), the model returns overdispersed forecasts. This is because model sampling variability is not accounted for in parameter calibration and a centering of ensemble members around the observation is produced.The use of a calibration relation linking model parameters with a meteorological observable at coarse scale may allow model sampling variability to be taken into account leading to consistent members.

When observed events display a large variability that a single calibration relation is not able to explain, underdispersed forecasts are produced. For example, this variability can be due to different physical origins or different synoptic conditions generating the events. To reach consistency, it would be necessary first to classify the events according to their physical origin and then to estimate storm-dependent calibration relations.

We remark that systematic analyses of the effects of precipitation type on scale-invariance statistical properties have not been yet conducted. We believe that this can be an interesting topic for future studies and the proposed verification method can be an useful tool to assess the need for single or multiple calibration relations.

Finally, we highlight that the verification method, tested with a homogeneous and isotropic model, can be applied whatever the downscaling method used since the method does not refer to the generation mechanism of the model. A slight modification needs to be adopted only in the case of a downscaling model reproducing spatial heterogeneity, because, in this case, the analysis should be carried out in each location rather than in the entire spatial domain.

## Acknowledgments

The authors thank three anonymous reviewers for comments that improved the quality of this manuscript.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**.**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* Giuseppe Mascaro, Dipartimento di Ingegneria del Territorio, Università di Cagliari, Cagliari, Italy. Email: gmascaro@unica.it