## Abstract

Intensity–duration–frequency (IDF) analyses of rainfall extremes provide critical information to mitigate, manage, and adapt to urban flooding. The accuracy and uncertainty of IDF analyses depend on the availability of historical rainfall records, which are more accessible at daily resolution and, quite often, are very sparse in developing countries. In this work, we quantify performances of different IDF models as a function of the number of available high-resolution (*N*_{τ}) and daily (*N*_{24h}) rain gauges. For this aim, we apply a cross-validation framework that is based on Monte Carlo bootstrapping experiments on records of 223 high-resolution gauges in central Arizona. We test five IDF models based on (two) local, (one) regional, and (two) scaling frequency analyses of annual rainfall maxima from 30-min to 24-h durations with the generalized extreme value (GEV) distribution. All models exhibit similar performances in simulating observed quantiles associated with return periods up to 30 years. When *N*_{τ} > 10, local and regional models have the best accuracy; bias correcting the GEV shape parameter for record length is recommended to estimate quantiles for large return periods. The uncertainty of all models, evaluated via Monte Carlo experiments, is very large when *N*_{τ} ≤ 5; however, if *N*_{24h} ≥ 10 additional daily gauges are available, the uncertainty is greatly reduced and accuracy is increased by applying simple scaling models, which infer estimates on subdaily rainfall statistics from information at daily scale. For all models, performances depend on the ability to capture the elevation control on their parameters. Although our work is site specific, its results provide insights to conduct future IDF analyses, especially in regions with sparse data.

## 1. Introduction

Urban flooding is being increasingly recognized as a pressing global challenge because of growing population concentration in urban regions, the rapid conversion of natural landscapes into impervious surfaces, and climate change (National Academies of Sciences, Engineering, and Medicine 2019; Zhang et al. 2018). As a result of the complex combination of these factors, urban areas have been suffering significant socioeconomic impacts from flooding not only due to catastrophic events (e.g., hurricanes; NWS 2019), but also after more frequent and less intense storms (University of Maryland College Park and Texas A&M University 2018). A key task to mitigate, manage, and adapt to urban flooding is the characterization of rainfall extreme statistics at different durations, with focus on short time accumulations (<24 h) that—in several cases—represent the hydrologic response times of urban watersheds. In water resources engineering, this statistical information is synthesized through intensity–duration–frequency (IDF) curves, which provide an estimate of rainfall intensities *i*(*T*_{R}, *τ*) for different time durations *τ* and frequencies *F*, usually expressed as return periods *T*_{R} = 1/(1 − *F*).

The generation of IDF curves has been the subject of several studies, mainly in the areas of technical and stochastic hydrology (e.g., Burlando and Rosso 1996; Koutsoyiannis et al. 1998; Madsen et al. 2002; Requena et al. 2019; Tyralis and Langousis 2019; and many others). When rainfall records are available at sufficiently fine temporal resolutions at a given site, the most common approach to conduct local IDF analyses consists of estimating *i*(*T*_{R}, *τ*) via frequency analysis of annual rainfall maxima at different *τ* with parametric statistical distributions. For example, as suggested by the extreme value theory (Coles 2001), the generalized extreme value (GEV) distribution has been shown to be able to model well annual rainfall maxima at several locations (e.g., Papalexiou and Koutsoyiannis 2013; Blanchet et al. 2016). If historical rainfall records are available at multiple sites, regional IDF curves are often generated by (i) spatially interpolating *i*(*T*_{R}, *τ*) or parameters of the statistical distributions from local or at-site estimations, or (ii) applying regionalization techniques that merge rain gauges into homogeneous regions to increase robustness in the estimate of the statistical distribution parameters (Hosking and Wallis 1997). As a notable example, the regionalization technique proposed by Hosking and Wallis (1997) was adopted in the National Oceanic and Atmospheric Administration (NOAA) Atlas 14 to generate IDF curves for most of the conterminous United States (CONUS; Bonnin et al. 2004).

The accuracy and uncertainty of IDF analyses are dependent on quality and availability of historical records. In developed countries, rainfall data are in general available; however, data availability is higher at daily scale compared to subdaily resolutions in terms of both spatial density and record length. As a result, the uncertainty of IDF estimates tends to be higher for subdaily durations. In several developing countries, where urbanization proceeds at fast rates (Cohen 2006) and IDF curves are urgently needed, the problem of data availability is even more critical, since rainfall records are available at sparse locations (Sane et al. 2018) and almost exclusively at daily resolution (Lumbroso et al. 2011; Liew et al. 2014). These challenges have significantly limited the ability to carry out IDF analyses, so that recent studies have proposed the use of satellite (Ombadi et al. 2018), reanalysis products (Courty et al. 2019), and weather radars (Marra and Morin 2015) to complement observations from sparse rain gauges.

A strategy to address the lack of rain gauges at subdaily resolution is to infer estimates of *i*(*T*_{R}, *τ*) for *τ* < 24 h from information on the rainfall statistics at the daily scale, which—as said—is more largely available (Koutsoyiannis and Foufoula-Georgiou 1993; Burlando and Rosso 1996; Menabde et al. 1999; Koutsoyiannis et al. 1998). A class of statistical models, known as scaling models, utilizes the evidence of temporal scaling properties of rainfall processes (Lovejoy and Schertzer 1985; Deidda et al. 1999; Venugopal et al. 1999; Mascaro et al. 2013, 2014) to derive simple relations linking parameters of the probability distributions of rainfall maxima at different subdaily durations to those estimated at daily scale. While the theoretical background of scaling models has been outlined more than two decades ago (e.g., Burlando and Rosso 1996), these techniques have been mainly applied for analyses at distinct locations, because of the lack of networks of long-term high-resolution rain gauges. Notable exceptions are the studies by Yu et al. (2004), Borga et al. (2005), Bara et al. (2010), and Blanchet et al. (2016) who used scaling models to derive regional IDF relationships in areas of northern Taiwan (~7400 km^{2}), northern Italy (~6400 km^{2}), western Slovakia (~14 000 km^{2}), and southern France (~38 000 km^{2}) with 46, 91, 19, and ~300 rain gauges, respectively. More recently, Sane et al. (2018) applied a scaling model with 14 rain gauges to derive IDF curves for the entire country of Senegal (~197 000 km^{2}).

The present study has the main goal of rigorously testing the accuracy of different models for IDF analysis as a function of the number of rain gauges available at different temporal resolutions. To pursue this goal, we use data from a high-resolution network of 223 gauges with record lengths from 20 to 40 years located in an area of ~29 600 km^{2} in central Arizona including the Phoenix metropolitan region (Mascaro 2017, 2018). We apply existing IDF models based on local (or at-site), regional, and scaling frequency analyses with the GEV distribution. In addition, we (i) evaluate the effect of bias correcting local estimates of the GEV shape parameter for record length (Papalexiou and Koutsoyiannis 2013) and (ii) propose and test for the first time a simple IDF model that applies the scaling theory to the regional growth curves defined by Hosking and Wallis (1997). The IDF models vary in terms of number of regional and spatially interpolated parameters. While all models require high-resolution rainfall records for their calibration, scaling models also take advantage of the availability of additional daily gauges. We first quantify the performance of all methods using all available gauges. Performance is assessed through metrics measuring the ability to simulate empirical and at-site quantiles. We then apply a cross-validation framework based on Monte Carlo bootstrapping experiments to quantify the uncertainty of model performances under different combinations of available high-resolution and daily rain gauges. Finally, we discuss our findings in the context of the rainfall regime of our study region and investigate the physical factors affecting performances of the different models. Results of this work will be useful, in future applications, to guide the selection of the technique for IDF generation that provides the best accuracy and lowest uncertainty under a given combination of available daily and high-resolution rain gauges.

## 2. Study area and dataset

The study area is located in a region in central Arizona that includes the Phoenix metropolitan area (Fig. 1a). The majority of the region is part of the Sonoran Desert where elevation is low (from 200 to 600 m MSL), while a minor portion is located in the transition zone to the Colorado Plateau, where elevation reaches up to 2000 m MSL in the Mogollon Rim. Climate is arid and characterized by a strong seasonality affecting the rainfall regime. In the summer months from July to September, the occurrence of the North American monsoon leads to localized convective thunderstorms with high rainfall intensity over short durations (<1 h). From November to March (hereinafter called winter), prolonged dry periods are interrupted by cold fronts that cause widespread storms lasting for up to multiple days. As a result of these different rainfall-generating mechanisms, annual rainfall maxima occur mainly in winter for *τ* ≥ 12 h and almost exclusively in summer for *τ* ≤ 2 h (Fig. 1b). Mascaro (2017, 2018) found also that annual rainfall maxima tend to increase with elevation and that the orographic control is more significant for larger time accumulations.

We use records of the Automated Local Evaluation in Real Time (ALERT) rain gauge network of the Flood Control District of the Maricopa County (FCDMC), installed to monitor in real time regional and localized storms, and support flood and flash-flood forecasting. The gauges have been gradually deployed since the beginning of the 1980s; as of 2020, their number is 365. In this study, 223 gauges with more than 20 years of observation are selected for the analysis (Figs. 1c,d). These gauges cover an area of ~29 600 km^{2}, with a mean density of ~1 gauge every 100 km^{2} that increases to 4.3 gauges in urban areas covering ~2000 km^{2}. The gauge elevations range from 220 to 2325 m MSL with the majority (195) installed below 800 m. All gauges are of the tipping-bucket type, with an accumulated rain depth of 1 mm. More details on this dataset and the characteristics of the study region are provided by Mascaro (2017, 2018), who used these rainfall records to investigate the spatiotemporal variability of rainfall statistical properties and the seasonality of daily extremes.

The FCDMC provides gauge data in raw format containing the tipping instants in seconds. We derive the cumulative rainfall signal for each year from the tipping instants with the technique described by Mascaro et al. (2013; see the appendix), which allows avoiding the discretization caused by the box counting method, that is often applied, and generating instead a smoother and more realistic signal. From this, the annual maximum rainfall intensities (mm h^{−1}) are extracted for *τ* = 30 min, 1, 2, 3, 6, 12, and 24 h through a moving window with duration *τ*. These durations are selected because a single scaling regime, which is needed to apply the IDF scaling models, emerges from 24 h to 30 min, consistent with previous applications (Menabde et al. 1999). We note that another regime appears in many stations for durations below 30 min, which are typical of single storms and cells. The existence of separate regimes at lower durations has been also reported by Innocenti et al. (2017) in an analysis of ~2700 stations in North America. In particular, these authors found a breaking point at ~1 h, which is higher than ours; this difference can be attributed to the diverse resolution of their datasets (15-min rainfall data with resolution of 2.54 mm). The approach proposed by Blanchet et al. (2016), in turn based on Papalexiou and Koutsoyiannis (2013), is adopted to retain or discard the annual rainfall maxima in years with missing data. The idea of this method is that, even if some data are missing in a given year, the annual maximum could have still been observed on that year and this value should be preserved. Assuming that *N* years without missing data are available and that a given year has a fraction *f* of missing data, the method involves the following steps: (i) the annual maxima of the *N* years are sorted, (ii) for a year with missing data, the annual maximum rainfall is extracted and its rank in the *N* sorted values is found, and (iii) this record is discarded or preserved if its rank is below or above *fN*, respectively. This procedure is applied for each duration *τ*. The resulting dataset includes a total of 44 781 annual rainfall maxima for seven durations over 223 gauges.

## 3. Methods

### a. The GEV distribution

The GEV distribution is used to describe the frequency of annual rainfall maxima at different durations *τ* in all IDF models tested here. If *I*_{τ} is the random variable of the rainfall maxima, the cumulative distribution function (CDF) of the GEV is defined as

where *k*_{τ} ∈ (−∞, +∞) is the shape parameter, *μ*_{τ} ∈ (−∞, +∞) is the location parameter, and *σ*_{τ} > 0 is the scale parameter. For *k*_{τ} = 0, the GEV is of type I or Gumbel (exponential) and *x* is defined in the set −∞ < *x* < ∞. For *k*_{τ} > 0, the GEV is of type II or Fréchet and is “heavy tailed” (subexponential); in this case, *μ*_{τ} − (*σ*_{τ}/*k*_{τ}) ≤ *x* ≤ ∞. If *k*_{τ} < 0, the GEV is of type III or Weibull and is bounded above so that −∞ < *x* ≤ *μ*_{τ} − (*σ*_{τ}/*k*_{τ}). The quantile of the GEV for a given return period *T*_{R} = 1/(1 − *F*) is computed as

Prior to applying the IDF models, we assess the suitability of the GEV through L-moments ratio diagrams and goodness-of-fit (GOF) tests, including Lilliefors, Anderson–Darling, and Cramér–von Mises. The GOF tests are applied locally following the modification proposed by Deidda and Puliga (2006) to account for the data discretization at 1-mm resolution. To account for results of multiple tests, the false discovery ratio test of Wilks (2006) is then used to evaluate the field or global significance with the resulting *p* values. The false discovery ratio test is performed at a global significance level *α*_{global} = 0.05; to consider the presence of spatial correlation of the rainfall records (Mascaro 2017), we conduct the local test at the significance level *α*_{local} = 2*·α*_{global}, as suggested by the synthetic experiments of Wilks (2016; his Fig. 4). Since some of the IDF models tested here are based on regionalization techniques, the GOF metric *Z* proposed by Hosking and Wallis (1997) is also computed. This metric quantifies the degree to which the L-moments of the fitted GEV distribution match the regional averages of the sampling L-moments. The fitted GEV is considered adequate if |*Z*| ≤ 1.64.

### b. Statistical models of rainfall IDF analysis

We apply two IDF models based on at-site estimations, one IDF model based on a regionalization technique, and two IDF simple scaling models. The models are described in the next sections, and their main features are summarized in Table 1.

#### 1) IDF models based on local parameter estimation

The simplest IDF model (hereinafter labeled as At-site) consists of locally estimating the GEV parameters *k*_{τ}, *μ*_{τ}, and *σ*_{τ} for each *τ*, using the probability-weighted moments (PWM) method, which has been shown to perform better than alternative techniques for relatively short sample sizes (Hosking and Wallis 1997). In an analysis of long-term global daily rainfall, Papalexiou and Koutsoyiannis (2013) showed that the Fréchet distribution (*k*_{τ} > 0) is the best suited to model annual maxima and proposed a set of analytical equations to bias correct *k*_{τ} as a function of the sample size. Here, to account for the sample size effect, an additional at-site IDF model is introduced where *k*_{τ} at each gauge is bias corrected using the relations proposed by Papalexiou and Koutsoyiannis (2013), and *μ*_{τ}, *σ*_{τ} are reestimated conditioned on the bias-corrected *k*_{τ}. This IDF model is labeled At-siteBC (or At-site bias corrected). For both At-site and At-siteBC models, estimates of *i*(*T*_{R}, *τ*) at ungauged locations can be obtained by spatially interpolating *k*_{τ}, *μ*_{τ}, and *σ*_{τ} and, then, applying Eq. (2). If *D* is the total number of durations, the application of these methods requires then 3*D* maps (Table 1).

#### 2) IDF model based on regional parameter estimation

Regionalization techniques have been proposed to increase the robustness of parameter estimation by combining records of multiple stations located in a region where the underlying statistical distribution of the annual rainfall maxima can be considered the same except for the mean. Here, we use the regionalization technique based on the index flood procedure proposed by Hosking and Wallis (1997) and used in several studies (Fowler and Kilsby 2003; Bonnin et al. 2004; Liu et al. 2015; García-Marín et al. 2019; among others). This method requires the estimation of a regional dimensionless distribution, known as growth curve, using the weighted averages of the L-moments calculated from records of all stations in the region (the weights are the sample sizes at each station). Local estimates of *i*(*T*_{R}, *τ*) are then obtained by multiplying the *T*_{R} quantiles of the growth curve, *i*_{GC}(*T*_{R}, *τ*), by the index flood, a variable representative of local rainfall characteristics. In this study, the GEV distribution is used to model the growth curve and the mean annual rainfall maxima at duration *τ*, $x\xaf\tau $, is utilized as index flood, leading to

The application of the regionalization method at ungauged sites can be achieved by spatially interpolating the sample index floods at available stations. This IDF model is labeled as Reg; its application requires 2*D* regional parameters (two GEV parameters of the growth curves for *D* durations; note that, since the mean of the growth curve is 1, the third parameter is calculated as a function of the first two) and *D* maps of $x\xaf\tau $ (Table 1). Here, we apply the Reg IDF model using a single homogenous region, whose existence is tested by computing, for each considered duration, the heterogeneity measure *H** proposed by Hosking and Wallis (1997) and described in the appendix.

#### 3) Local and regional IDF scaling models

A brief theoretical overview of simple scaling models is first provided. Based on empirical findings, Koutsoyiannis et al. (1998) proposed a mathematical formula to link *i*(*T*_{R}, *τ*) with *τ* and *T*_{R}:

where *a*(*T*_{R}) is determined from the probability distribution of the maximum rainfall intensity (e.g., the GEV), and *θ* ≥ 0 and 0 < *η* ≤ 1 are nonnegative coefficients to be estimated at the target site. By separating *T*_{R} from *τ*, an important advantage of Eq. (4) is that it allows relating rainfall intensities at different durations (e.g., *τ* and *τ*_{0}) for the same *T*_{R}:

For *θ* = 0, it has been shown (e.g., Burlando and Rosso 1996; Blanchet et al. 2016) that, when applied to random variables, Eq. (5) corresponds to the hypothesis of simple scaling:

where the symbol $=d$ indicates equality in distribution and *η* is the scaling exponent. Equation (6), in turn, implies that moments of order *q* are related as

A simple scaling model can be derived from Eq. (7) using the GEV distribution under the assumption of a constant shape parameter for all durations (i.e., *k*_{τ} = *k**). In this case, Eq. (7) leads to the following scaling relations for *μ*_{τ} and *σ*_{τ}:

Equations (8) imply that, if the GEV parameters are known at a coarse scale *τ*_{0} (here, 24 h), then GEV parameters can be derived at smaller scales *τ* via the scaling exponent *η*. Recently, the accuracy of this scaling model was tested by Innocenti et al. (2017) on ~2700 gauges located in North America, showing adequate performances over different scaling regimes from 1 h to 7 days. In this study, we follow the application of the GEV scaling model of Blanchet et al. (2016), which requires the estimation of *μ*_{τ0}, *σ*_{τ0}, *k**, and *η* at each gauge by maximizing the log-likelihood function that includes all durations at the same time (Blanchet et al. 2016). To apply this model (hereinafter labeled Sc) at ungauged sites, the four parameters have to be spatially interpolated; as a result, the model requires four spatial maps (Table 1). When available, records of additional daily gauges can be used to improve the accuracy of the spatial maps of *μ*_{τ0}, *σ*_{τ0}, and *k****.

In addition to the at-site scaling model Sc, we develop here (to our knowledge, for the first time) a regional simple scaling model by combining the scaling relations in Eqs. (5) and (7) to Eq. (3) linking the quantiles of local distribution and growth curve. Using Eq. (3), the ratio between the quantiles of the local distributions for two durations *τ* and *τ*_{0} can be expressed as

Per Eq. (5) with *θ* = 0: *i*(*T*_{R}, *τ*)/*i*(*T*_{R}, *τ*_{0}) = (*τ*/*τ*_{0})^{−η}. It is also easy to note that, for *q* = 1, Eq. (7) implies that

As a result, Eq. (9) reduces to *i*_{GC}(*T*_{R}, *τ*) = *i*_{GC}(*T*_{R}, *τ*_{0}), which shows that, under the simple scaling hypothesis, the regional growth curve is the same for all durations. This implies that rainfall intensity quantiles can be derived at each location and for any duration *τ* as a function of $x\xaf\tau 0$, the scaling exponent *η* of Eq. (10), and the GEV parameters of the single growth curve:

This IDF model, labeled RegSc, requires two regional parameters of the GEV distribution modeling the single growth curve and two spatial maps of $x\xaf\tau 0$ and *η* (Table 1). The availability of additional daily rain gauges allows improving the accuracy of the spatial maps of $x\xaf\tau 0$.

### c. Monte Carlo bootstrapping simulations

The IDF models are assessed as a function of different numbers of available high-resolution and daily rain gauges through cross validation with Monte Carlo simulations based on bootstrap resampling. The simulations consist of the following steps:

First

*N*_{val}= 100 high-resolution gauges are randomly selected from the*N*= 223 available and are used as independent sites for validation.From the (

*N*−*N*_{val}) remaining gauges,*N*_{τ}stations are randomly selected as high-resolution gauges and used to calibrate all models. Performance metrics (see section 3d) are computed at the*N*_{val}validation gauges.From the (

*N*−*N*_{val}−*N*_{τ}) remaining gauges,*N*_{24h}gauges are randomly extracted and used as additional daily gauges. To quantify the value of increasing information available at daily scale,*N*_{24h}= 1 gauge is initially selected and, then, other gauges are recursively added until*N*_{24h}= 50, resulting in the tested values*N*_{24h}= 1, 2, …, 10, 20, …, 50. This implies that, for example, the gauges selected for*N*_{24h}= 30 include the stations extracted for*N*_{24h}= 20. Note that daily records are simply obtained by aggregating the high-resolution records.For a given combination of

*N*_{τ}and*N*_{24h}stations, the Sc and RegSc scaling models are calibrated using the*N*_{τ}high-resolution and*N*_{24h}daily stations and performance metrics at the calibration gauges are quantified.Step 1 is repeated 10 times to characterize the sampling variability of the validation gauges. For each

*N*_{val}, step 2 is repeated 10 times for eight values of*N*_{τ}ranging from 1 to 50 to characterize the sampling variability of the high-resolution gauges once the validation stations are fixed. For given*N*_{val}and*N*_{τ}, steps 3 and 4 are repeated 10 times to characterize the sampling variability of the daily gauges. Overall, 10 × 10 × 8 × 10 = 8000 Monte Carlo iterations are performed.

The application of the IDF models at ungauged sites (i.e., the *N*_{val} validation gauges) involves the spatial interpolation of the model parameters (see Table 1 and section 3b), which can be achieved with several techniques (see, e.g., Watkins et al. 2005). Since evaluating the performance of the spatial interpolation method is out of the scope of this study, here we apply the inverse distance weighting interpolation method, which is widely known and easy to implement while being relatively accurate in our region (analyses not shown).

### d. Model diagnostic

The statistical models are diagnosed through a set of metrics proposed by Hosking and Wallis (1997). For a given *T*_{R}, rainfall duration *τ*, and site *j*, the *m*th IDF model returns the quantile of annual maximum rainfall intensity $ijm\u2061(TR,\u2009\tau )$. The relative bias from a reference intensity $ijREF\u2061(TR,\u2009\tau )$ is defined as

To provide an overall picture of the model performance, the mean relative bias and relative root-mean-square error can be respectively computed as

where *N* is the total number of validation stations. Note that, for simplicity, the symbols *m*, *τ*, and *T*_{R} have been omitted in RB_{j}, RB, and RRMSE. Here, the performance metrics are computed using two references: (i) the quantiles of the empirical distributions for *T*_{R} up to 30 years and (ii) the quantiles of the At-site IDF model for *T*_{R} extending through 200 years. In such a way, we test the models’ ability to simulate observed records and quantiles derived for large, nonobserved *T*_{R}, that are often needed for infrastructure design.

## 4. Results

### a. Evaluation of the GEV hypothesis

Figures 2a–c show the L-moment ratio diagrams for records of annual rainfall maxima at *τ* = 1, 6, and 24 h, respectively, of the 223 rain gauges (results are similar at other durations). In all cases, the sample estimates are scattered around the GEV line, suggesting that this distribution appears to be best candidate to model annual rainfall maxima at all durations. This is corroborated by the facts that (i) stations with shorter (longer) records tend to be located farther from (closer to) the GEV line, and (ii) the sample means are located very close to or lie exactly on the GEV line. Results of the GOF tests confirm the suitability of the GEV distribution with at-site and bias-corrected parameters (see section 4b). For both estimation methods and all durations, the false discovery ratio test applied to all GOF tests indicates that the global null hypothesis (i.e., the sample is drawn from a GEV distribution) cannot be rejected at *α*_{global} = 0.05. We also find that the regional GOF metric *Z* of Hosking and Wallis (1997) is always lower in absolute value than the suggested threshold of 1.64 (*Z* ≤ 0.81 for 1 ≤ *τ* ≤ 24 h and *Z* = 1.56 for *τ* = 30 min), thus indicating that the null hypothesis of a regional GEV cannot be rejected for any durations.

### b. Estimation of local and regional GEV parameters

The local estimation of the GEV parameters conducted to apply the At-site and At-siteBC models is summarized in Table 2, which reports minimum and maximum parameter values for all *τ* (boxplots are also reported in Fig. S1 of the online supplemental material). Figure 3 also shows the relations between shape parameter *k*_{τ} and sample size for three representative durations. In all cases, the sample values of *k*_{τ} have mean larger than 0 and variability that is captured very well by the relation proposed by Papalexiou and Koutsoyiannis (2013), depicted by the shaded area. The bias correction of *k*_{τ} significantly reduces the variability of the shape parameter, as shown by the histograms and the values in Table 2.

To apply the Reg IDF model, we first compute the heterogeneity measure *H**** and find this metric to range from 1.03 to 1.21 across the durations, suggesting that the hypothesis of a single operationally homogeneous region can be sustained. This is further supported by comparing visually the L-moment ratio diagrams of (i) the observed samples (Figs. 2a–c) and (ii) synthetic samples generated by a kappa distribution fitted to the regional L-moments ratios (Figs. 2d–f; see appendix for details). The regional GEV parameters of the growth curves for the different durations are presented in Table 3, along with the means used for the single growth curve in the RegSc IDF model.

### c. Evidence of simple scaling and estimation of GEV parameters for the scaling models

We find evidence of scaling in the range from *τ* = 30 min to *τ* = 24 h for *μ*_{τ}, *σ*_{τ} and mean of annual rainfall maxima $x\xaf\tau $. This is quantified by the coefficient of determination, *R*^{2}, of the linear regression between the log-transformed variables in Eqs. (8a) and (8b) and (10) being larger than 0.99 for *μ*_{τ} and $x\xaf\tau $ and 0.90 for *σ*_{τ}. As an example, Fig. 4 shows evidence of scaling for three representative gauges with L-moments ratios close to regional mean (identifier 52800) and in the upper-right (identifier 68200) and lower-left (identifier 41500) corners of the L-moments ratio diagram. The slope of the regression line provides an estimate of the scaling exponent *η*. The distributions of *η* computed for all gauges are presented as boxplots in Fig. 5. No evident differences exist between At-site and At-siteBC estimates. The *η* estimates for *μ*_{τ} are lower relative to those for *σ*_{τ} (mean *η* of 0.77 and 0.83 for *μ*_{τ} and *σ*_{τ}, respectively) and are characterized by less variability. While relatively small, these differences may affect performance of the simple scaling IDF models, which assume the same value of *η* for both location and scale parameters (see discussion in section 5). The distributions of *η* for $x\xaf\tau $ and *μ*_{τ} are very similar.

Based on the evidence of scaling, we estimate parameters of the Sc model at each gauge, including *k**, *μ*_{τ0}, *σ*_{τ0} (*τ*_{0} = 24 h), and the single scaling exponent, *η*. As reported in Table 2 (italic font), the ranges of *μ*_{τ0} and *σ*_{τ0} are very close to those of the local estimates at the same duration. In contrast, the range of *k** spans the values of all local estimates of this parameter across all durations. This result is expected because the estimate of a constant shape parameter is affected by the variability of rain rates at all durations. Figure 5 shows that *η* for Sc displays the same intergauge variability of the *η* estimates for *μ*_{τ} and $x\xaf\tau $.

### d. Performance of IDF models using all gauges

The IDF models are first applied using all *N* = 223 available gauges. Figure 6 presents examples of the comparison between the GEV parameterized with the five models and empirical CDFs for the same three gauges shown in Fig. 4. Results are shown for *τ* = 1, 6, and 24 h. At gauge 52800 (Figs. 6a–c), whose L-moments are close to the regional means at all durations, all models capture well the observed records, apart from a slight underestimation of the right tail by the scaling models (Sc and Reg1Sc) for *τ* = 1 h; that is, the lines are located to the left of the circles for larger values of the rainfall rate. At gauge 68200 (Figs. 6d–f), which is characterized by higher L-skewness and L-kurtosis than the regional means, all models capture well the bulk of the empirical distributions, but the regional models (Reg and RegSc) underestimate the right tails at all durations. At gauge 68200 (Figs. 6g–i), whose L-skewness and L-kurtosis are lower than the regional means, all models reproduce reasonably well all empirical distributions. Performance metrics computed on all gauges are summarized in Fig. 7 through heat maps showing RD and RRMSE for each model as a function of durations and return periods. Performances are relatively similar across all models when compared to the observed quantiles up to *T*_{R} = 30 years. When At-site quantiles are used as reference, the At-siteBC and Reg models exhibit the best performance with negligible (slightly positive) RB but higher (smaller) RRMSE for the Reg (At-siteBC) model. The scaling models (Sc and RegSc) have a larger range of RB with both positive and negative values, and RRMSE up to ~35% for *T*_{R} = 200 years. This lower accuracy using all available gauges is expected since these models depend on a lower number of parameters. These results provide an overview of model accuracy by quantifying their performance at the same sites used for model calibration. In the Monte Carlo bootstrapping experiments described next, model accuracy is tested using 100 randomly selected validation sites and varying the number of calibration gauges.

### e. Performance of IDF models as a function of the number of gauges

For the sake of conciseness, we present results of the Monte Carlo simulations for selected values of *τ*, *T*_{R}, *N*_{τ}, and *N*_{24h} that are representative of the overall outcomes. In the following figures, the simulation uncertainty is visualized through the mean and 90% confidence intervals of the metrics and by reporting results for three randomly chosen simulations. We begin by showing in Fig. 8 how performances change with *T*_{R} for *τ* = 30 min, *N*_{τ} = 5 high-resolution gauges and no additional daily gauges. When compared to the observed quantiles (Fig. 8a), all models display similar results, with RB decreasing (i.e., the bias becomes more negative) and RRMSE increasing as *T*_{R} rises. Performances are worse for the scaling models that have more negative RB and larger uncertainty, that is, the width of the 90% confidence interval is comparatively larger. It is worth noting that the At-site and At-siteBC models have very similar accuracy, implying that bias correcting the shape parameter does not significantly impact the ability of simulating the observed quantiles for *T*_{R} ≤ 30 years. Performances for larger *T*_{R} are explored using the At-site model as reference for the metrics (Fig. 8b). RRMSE increases significantly for *T*_{R} > 50 years, with similar values across all models. The At-site and Reg models are, on average, unbiased. For the At-siteBC model, RB increases with *T*_{R} because bias correcting the shape parameter makes the tail of the GEV distributions heavier leading to larger quantiles at higher *T*_{R}. This also results in lower uncertainty. The scaling models exhibit instead negative mean RB for *T*_{R} > 50 years (more significantly, for the Sc model).

As a next step, we investigate how performances vary with *τ*, reporting results in Figs. 9a and 9b for *N*_{τ} = 5 gauges, no additional daily gauges, and *T*_{R} = 10 and 100 years for the observed and At-site reference quantiles, respectively. When compared with the empirical quantiles, performances are similar for At-site, At-siteBC, and Reg models, with small variations of the metrics across the durations except for a slightly negative RB for *τ* < 12 h. The scaling models (Sc and RegSc) exhibit instead significant negative biases for *τ* < 12 h; their performances degrade particularly at *τ* = 1 h where the mean RB and RRMSE reach about −15% and 22%, respectively. Interestingly, for all models the simulation uncertainty increases with *τ*. These findings are largely in line with results for the case of *T*_{R} = 100 years with At-site quantiles used as reference (Fig. 9b), with the exceptions that (i) the At-site and At-siteBC models are positively biased, (ii) the Reg model is unbiased, and (iii) for all models, RRMSE for *τ* = 6 h is the largest.

The effect of the number of available gauges on model performances is explored in Figs. 10–12 . We first evaluate how accuracy changes with *N*_{τ} when no additional daily gauges are used, showing in Figs. 10a and 10b results for *τ* = 30 min and *T*_{R} = 10 (100 years) for the observed (At-site) reference quantiles. In all cases, (i) the uncertainty across the simulations is very large for *N*_{τ} = 1 gauge and decreases dramatically up to *N*_{τ} = 5 and at a lower rate for *N*_{τ} > 5, remaining constant after *N*_{τ} = 50 gauges (not shown); (ii) the average RB is constant across *N*_{τ}; and (iii) the mean RRMSE decreases as *N*_{τ} grows, reducing its value of ~30% when *N*_{τ} increases from 1 to 50. We subsequently investigate the value of additional information provided by daily rain gauges, considering the case of *N*_{τ} = 1 and *N*_{24h} = 10 that mimics the plausible situation of a single high-resolution gauge available (e.g., at the closest airport) plus additional daily gauges. Figure 11 displays the relations between metrics and *T*_{R} for At-Site and scaling models with *N*_{24h} = 0 and 10, using the At-Site quantiles as reference. It is apparent that adding ten daily gauges to the calibration of the scaling models results in a dramatic reduction of uncertainty, as quantified by the width of the 90% confidence interval decreasing by ~50%, and RRMSE, with reductions ranging from −7.6% for *T*_{R} = 10 years to −22% for *T*_{R} = 200 years. To summarize results of the simulations, Fig. 12 presents heat maps of mean and width of the 90% confidence intervals of RB and RRMSE for all models as a function of *N*_{τ} and *N*_{24h}, for *τ* = 30 min and *T*_{R} = 100 years. The key messages are (i) when *N*_{24h} = 0, results do not change significantly for *N*_{τ} ≥ 10; and (ii) the largest improvement introduced by the scaling models is for *N*_{τ} ≤ 5 and *N*_{24h} ≥ 10.

## 5. Discussion

We first interpret our findings in the context of the rainfall climatology of the study region. Following previous studies indicating that elevation exerts a moderate to significant control on rainfall processes and statistical properties in this area (Mascaro 2017, 2018), we investigate the relations between IDF model parameters and gauge elevation, *z*. Figures 13a–c show these relations for the GEV parameters of the At-site and Sc models for *τ* = 24 h, chosen as an example, while Fig. 13d presents the corresponding correlation coefficient (CC) as a function of *τ* (results are similar for the other models). The spatial maps of the parameters for two representative durations are also reported in Fig. S2 of the online supplemental material. The effect of elevation is important for *μ*_{τ} (0.67 ≤ CC ≤ 0.80) and, to a lesser extent, *σ*_{τ} (0.26 ≤ CC ≤ 0.65); for both parameters, the *p* values of the linear regressions are always lower than 0.01. The shape parameter *k*_{τ} does not exhibit instead significant orographic or geographic control. This is also true for the scaling exponent *η* as displayed in Fig. 13e and quantified by the high *p* values of the linear regressions with *z* (not shown). These findings suggest that terrain affects mainly the mean of the distributions of annual rainfall maxima, but it does not induce important effects on the shape of the distributions (i.e., the right tail) and the rainfall scaling properties. This result is consistent with Blanchet et al. (2016) for location and scale parameters, while it differs for shape parameter and scaling exponent that, in their study region, are affected by topography and distance from the coast.

The different elevation controls on GEV parameters and scaling exponent are a main factor impacting IDF model performances. In general, since all IDF models require estimates of *μ*_{τ} and *σ*_{τ} that are affected by terrain, the ability to reproduce extreme rainfall in this region depends on the elevation range of the available high-resolution gauges. In particular, At-site, At-siteBC and Reg models use information on parameter variability based on gauge elevation to estimate quantiles at each duration. Scaling models derive instead quantiles at the different durations through the scaling exponent *η*, which exhibits spatial random variations. As a result, when a sufficient number of high-resolution gauges is available to capture the orographic control on *μ*_{τ} and *σ*_{τ} (e.g., *N*_{τ} ≥ 10), At-site, At-siteBC and Reg models overperform scaling models because, for these, that number of gauges does not allow describing in detail the spatial variability of *η*. An additional factor conditioning the accuracy of simple scaling models is the adoption of a single scaling exponent, which may not hold at all gauges, as shown in Fig. 5.

## 6. Conclusions

The comparison of local, regional, and simple scaling models for rainfall IDF analysis conducted in this study using 223 rain gauges in central Arizona makes it possible to draw the following recommendations supporting the generation of IDF curves in other regions:

Performances of IDF models are impacted by the control of rainfall regime and local geographic features on the spatial variability of model parameters. In our study region, elevation controls the mean of the annual rainfall maxima distributions, which is related to the GEV scale and location parameters. The shape parameter and scaling exponent do not exhibit instead clear spatial patterns. These features lead to distinct impacts on model performance that may vary at other sites with different rainfall regimes. Future work based on synthetic experiments should investigate the relative importance of each model parameter under different scenarios where the corresponding spatial variability is assumed to be random or linked to geographic features (e.g., latitude, longitude, elevation, and distance from the coast).

As expected, model uncertainty and RRMSE decrease with the number of high-resolution gauges

*N*_{τ}used for calibration. This reduction is dramatic from*N*_{τ}= 1 to*N*_{τ}= 5, less significant for*N*_{τ}> 5, and negligible after*N*_{τ}= 50. The mean bias is mostly constant with*N*_{τ}*.*All models exhibit similar performances in the ability to simulate observed quantiles of relatively short records (≤30 years), apart from the scaling models that display slightly negative biases at shorter durations.

If

*N*_{τ}is large enough to allow capturing with sufficient details the spatial variability of the GEV scale and location parameters at different durations, the At-site, At-siteBC and Reg models have the best performances when compared to quantiles of the GEV distributions estimated locally (the At-site quantiles). In our study region, this occurs when*N*_{τ}≥ 10. The Reg and At-site model are unbiased and slightly positively biased, respectively. The bias correction of the shape parameter based on global rainfall records performed in the At-siteBC model leads to positive biases at larger*T*_{R}*.*Since these return periods are used for design of critical infrastructure, the use of the At-siteBC model is recommended. Future work should focus on (i) investigating whether the relations for correcting the bias of shape parameter estimates change for durations below 24 h and (ii) incorporating the shape parameter bias correction in regionalization techniques.The accuracy of scaling models is affected by two main factors, including the ability to capture the spatial variability of the scaling exponent, and the possibility that the assumption of a single scaling exponent may not hold. This may require the use of more complicated multiple scaling models (Burlando and Rosso 1996; Van de Vyver 2018). In our study site, simple scaling IDF models overperform local and regional models when

*N*_{τ}≤ 5 and more than 10 additional daily gauges are available. These conditions can be very common in developing countries.

## Acknowledgments

The author thanks Dr. Juliette Blanchet, two anonymous reviewers, and the peer-review editor for their constructive comments that helped to improve the quality of the paper. This work has been supported by the National Science Foundation (NSF) Award “SCC: Community-Based Automated Information for Urban Flooding” (Award 1831475). The support by NSF (Grant EAR-1928724) and NASA (Grant 80NSSC19K0726) to organize the 12th International Precipitation Conference (IPC12), Irvine, California, June 2019, and produce the IPC12 special collection of papers is also gratefully acknowledged. The author thanks Stephen D. Waters from the Flood Control District of Maricopa County for providing the rainfall data of the network. FCDMC data are also available online (https://www.fcd.maricopa.gov/625/Rainfall-Data).

### APPENDIX

#### Heterogeneity Measure

Following Hosking and Wallis (1997), let *t*^{R}, $t3R$, and $t4R$ be the regional L-coefficient of variation (L-CV), L-skewness, and L-kurtosis, computed by averaging the sampling L-moments ratios of records of *N* gauges with weights given by the sample size. Let also *V* be the weighted standard deviation of the sample L-CVs (labeled *V*_{sam}). The computation of the heterogeneity measure requires the following steps:

The kappa distribution is fitted to the regional L-moment ratios. This four-parameter distribution is chosen for its ability to represent many distributions used for rainfall extremes. In such a way, no commitment is made on the underlying distribution.

Then

*N*_{ens}= 2000 statistical simulations are conducted. In each simulation,*N*variates are generated from the kappa distribution using the same record lengths of the original gauges. Figures 2d–f show examples of scatterplots of L-skewness versus L-kurtosis for one simulation of kappa covariates for three durations.Calculate

*V*for each simulation, and from these 2000 values compute the mean,*μ*_{V}*.*Compute the heterogeneity measure

*H****=*V*_{sam}/*μ*_{V}, which allows detection of operationally homogeneous regions when a large number of gauges is used, as in this case where*N*= 223.

Values of *H**** close to 1 indicate that the sample L-CVs exhibit similar variability to those generated by a single homogenous region.

## REFERENCES

*J. Hydrol. Hydromech.*

*J. Hydrol.*

*Semiarid Southwest (Arizona, Southeast California, Nevada, New Mexico, Utah).*Vol. 1,

*Precipitation-Frequency Atlas of the United States*

*Nat. Hazards*

*J. Hydrol.*

*Technol. Soc.*

*An Introduction to Statistical Modeling of Extreme Values*

*Environ. Res. Lett.*

*Phys. Chem. Earth*

*Water Resour. Res.*

*Int. J. Climatol.*

*J. Hydrol.*

*Regional Frequency Analysis*

*Hydrol. Earth Syst. Sci.*

*Water Resour. Res.*

*J. Hydrol.*

*Geosci. Lett.*

*Nat. Hazards*

*Water Resour. Res.*

*J. Flood Risk Manage.*

*Water Resour. Res.*

*J. Hydrol.*

*J. Hydrometeor.*

*J. Hydrol.*

*Hydrol. Earth Syst. Sci.*

*J. Appl. Meteor. Climatol.*

*Water Resour. Res.*

*Framing the Challenge of Urban Flooding in the United States*

*Water Resour. Res.*

*Water Resour. Res.*

*Hydrol. Processes*

*Hazards Earth Syst. Sci*

*Stochastic Environ. Res. Risk Assess.*

*Hydrol. Processes*

*J. Geophys. Res.*

*J. Amer. Water Resour. Assoc.*

*J. Appl. Meteor. Climatol.*

*Bull. Amer. Meteor. Soc.*

*J. Hydrol.*

*Nature*