## Abstract

To satisfy a wide range of end users, rainfall ensemble forecasts have to be skillful for both low precipitation and extreme events. We introduce local statistical postprocessing methods based on quantile regression forests and gradient forests with a semiparametric extension for heavy-tailed distributions. These hybrid methods make use of the forest-based outputs to fit a parametric distribution that is suitable to model jointly low, medium, and heavy rainfall intensities. Our goal is to improve ensemble quality and value for all rainfall intensities. The proposed methods are applied to daily 51-h forecasts of 6-h accumulated precipitation from 2012 to 2015 over France using the Météo-France ensemble prediction system called Prévision d’Ensemble ARPEGE (PEARP). They are verified with a cross-validation strategy and compete favorably with state-of-the-art methods like analog ensemble or ensemble model output statistics. Our methods do not assume any parametric links between the variables to calibrate and possible covariates. They do not require any variable selection step and can make use of more than 60 predictors available such as summary statistics on the raw ensemble, deterministic forecasts of other parameters of interest, or probabilities of convective rainfall. In addition to improvements in overall performance, hybrid forest-based procedures produced the largest skill improvements for forecasting heavy rainfall events.

## 1. Introduction

Accurately forecasting weather is paramount for a wide range of end-users (e.g., air traffic controllers, emergency managers, and energy providers) (see, e.g., Pinson et al. 2007; Zamo et al. 2014). In meteorology, ensemble forecasts try to quantify forecast uncertainties due to observation errors and incomplete physical representation of the atmosphere. Despite its recent developments in national meteorological services, ensemble forecasts still suffer from bias and underdispersion (see, e.g., Hamill and Colucci 1997; Hagedorn et al. 2012; Baran and Lerch 2018). Consequently, they need to be postprocessed in order to alleviate bias and misdispersion (Hamill 2019). Moreover, not all meteorological variables are equal in terms of forecast and calibration. In particular, Hemri et al. (2014) highlighted the steep hill in skillful rainfall forecasting.

### a. Existing postprocessing methods and their rainfall-specific applications

At least two types of statistical methods have emerged in the last decades: analog ensemble (e.g., Delle Monache et al. 2013; Junk et al. 2015; Keller et al. 2017; Hamill and Whitaker 2006; Alessandrini et al. 2015; Eckel and Delle Monache 2016) and ensemble model output statistics (EMOS; see, e.g., Wilks 2015; Scheuerer and König 2014; Scheuerer and Möller 2015; Gneiting et al. 2005). The analog ensemble is fully nonparametric and consists of finding similar atmospheric situations in the past and using them to improve the present forecast. In contrast, EMOS belongs to the family of parametric regression schemes. If *y* represents the weather variable of interest and is the corresponding *m* ensemble member forecasts, then the EMOS predictive distribution is simply a distribution whose parameters depend on the values of .

Less conventional approaches have also been studied recently. For example, Van Schaeybroeck and Vannitsem (2015) investigated member-by-member postprocessing techniques, and Taillardat et al. (2016) found that quantile regression forests (QRF) techniques performed well for temperature and wind speed forecasting.

Modeling precipitation distributions is a challenge by itself. It is a mixture of zeros (dry events) and positive intensities (i.e., rainfall amounts for wet events). The latter have a skewed distribution. For daily precipitation, extended logistic regression is frequently applied (see, e.g., Hamill et al. 2008; Roulin and Vannitsem 2012; Ben Bouallègue 2013). Bayesian model averaging techniques (Raftery et al. 2005; Sloughter et al. 2007) were also used in rainfall forecasting, with a gamma fit often applied to cube root transformed precipitation accumulations. As for analogs and EMOS techniques, they have been applied to calibrate daily rainfall (see, e.g., Hamill and Whitaker 2006; Scheuerer 2014; Scheuerer and Hamill 2015).

It appears that one popular and flexible choice to model rainfall amounts is to use the gamma distribution or to build on it. Most studies use predictors such as those listed in Table 1. For example, Scheuerer and Hamill (2015) and Baran and Nemoda (2016) in a rainfall calibration context employed the censored-shifted gamma (CSG) probability distribution function (pdf) defined for any level of censoring and any positive *y* by

where the positive constants are the two gamma parameters and the probability represents the mass of the gamma distribution below *δ*. The probabilities of zero and positive precipitation are thus treated together. One possible drawback of the CSG is that heavy daily and subdaily rainfall may not always have an upper tail with an exponential decay as a gamma distribution has, but rather a polynomial one instead. This latter point is a key element in any weather risk analysis (see, e.g., Katz et al. 2002; De Haan and Ferreira 2006).

To bring the desirable flexibility in modeling upper tail behavior in a rainfall EMOS context, Scheuerer (2014) worked with a so-called censored generalized extreme value (CGEV) pdf, defined for any positive *y* by

in terms of the pdf associated to the classical generalized extreme value (GEV) distribution. Besides, . Recall that the latter GEV distribution has a cumulative distribution function (cdf) *G* defined by

### b. Forest-based methods in numerical weather prediction

To our knowledge, the first application of random forest algorithms (Breiman 2001) in the area of numerical weather prediction occurred in 2009. Gagne et al. (2009) used this technique for the classification of convective areas, and later on, to produce hail occurrence forecasts (Gagne et al. 2017). Recently, Herman and Schumacher (2018) apply random forest to probabilistic forecasts of extreme precipitation. Zamo et al. (2014) use both random forests and QRF for photovoltaic electricity production, using meteorological variables. In a postprocessing context, Taillardat et al. (2016) show that QRF performs better than EMOS for the postprocessing of temperature and wind speed. One may wonder if QRF could favorably compete with EMOS and analog ensemble for rainfall calibration. This question is particularly relevant since methodological advances have been recently made concerning random forests and quantile regressions. In particular, Athey et al. (2016) proposed an innovative way, called gradient forests (GF), of using forests for quantile regression.

### c. Limits of forest-based methods, and a semiparametric procedure for rainfall calibration

As pointed out in Taillardat et al. (2016) and Rasp and Lerch (2018), one drawback of data-driven approaches like QRF and GF is that their intrinsic nonparametric nature make them useless to predict beyond the largest recorded rainfall. To circumvent this limit, we propose here to combine random forest techniques with a parametric distribution.

Indeed, in order to get distributions representing not only heavy rainfall but also low precipitation amounts, Naveau et al. (2016) recently proposed a class of models referred as the extended generalized Pareto (EGP) that allows a smooth transition between generalized Pareto (GP)-type tails and the middle part (bulk) of the distribution. It bypasses the complex thresholds selection step to define extremes. Low precipitation can be modeled by a gamma distribution, while the Pareto distribution better fits heavy rainfall. The benefits of using (1) and (2) are kept, without their drawbacks. In other words, our model for the precipitation pdf is

where is the pdf associated to . In contrast to (1) and (2), the probability weight *π* is not obtained by censoring, and it is just a parameter independent of . The differences between this distribution and the gamma one are illustrated in Fig. 1 of Naveau et al. (2016).

Hence, local random forest-based postprocessing techniques are aimed at predicting extreme events, and this will provide an interesting path to improve prediction beyond the largest values of the sample at hand.

### d. Main contributions of this paper and outline

In this study, we will focus on 6-h rainfall amounts in France (denoted later by “rr6”) because this is the unit of interest of the ensemble forecast system of Météo-France. We propose to implement and test QRF and GF methods (and their semiparametric extensions) for station-wise rainfall calibration and compare them with other approaches such as EMOS and analog ensemble. The EMOS approach is implemented with three different parametric pdfs, respectively defined by (1), (2), and (4). Besides comparing these three EMOS models, it is natural to wonder if other nonparametric approaches such as analog ensemble are competitive.

This article is organized as follows. In section 2, we recall the basic ingredients to create quantile regression forests and gradient forests, as well as the calibration process recently introduced by Athey et al. (2016) for quantile regression through GF. The way the trees are combined with the EGP pdf defined by (4) is then detailed. In section 3, the state-of-the-art EMOS procedures are sketched, and how the EGP pdf is integrated within an EMOS scheme is explained. Section 4 presents the analog ensemble technique. The different approaches are summarized in Table 2 and implemented in section 5. Therein, the test bed dataset of 86 French weather stations and the French ensemble forecast system of Météo-France called Prévision d’Ensemble ARPEGE (PEARP; Descamps et al. 2015) are described, and the verification tools used in this study are presented. Then, we assess and compare each method with a special interest for heavy rainfall, see section 6. The paper closes with a discussion in section 7.

## 2. Quantile regression forests, gradient forests, and semiparametric tail extension

### a. Quantile regression forests

Given a sample of predictors–response pairs, say for , classical regression techniques connect the conditional mean of a response variable *Y* to a given set of predictors *X*. The quantile regression forest method introduced by Meinshausen (2006) also consists in building a link, but between an empirical cdf and the outputs of binary decision trees. Before explaining this particular cdf, we need to recall how trees are constructed.

A random forest is an aggregation of randomized trees based on bootstrap aggregation on the one hand, and on classification and regression trees (CART; Breiman et al. 1984; Breiman 1996) on the other hand. These trees are built on a bootstrap copy of the samples by recursively maximizing a splitting rule. Let denote the group of observations to be divided into two subgroups, say and . For each group, we can infer its homogeneity defined by

where corresponds to the sample mean in . To determine if this splitting choice is optimal, the homogeneities and are compared to the one of . For example, if wind speed is one predictor in *X* and if dividing low and large winds can better explain rainfall, then the cutting value, say *s*, will be the one that maximizes

where is a random subset of the predictors in the predictors’ space . This dichotomy procedure goes on until a stopping criterion is reached. This criterion is usually defined by a minimum number of observations in final groups, called leaves. For a better understanding, an example of binary regression tree is provided at the top of Fig. 1. Each tree is elaborated from a bootstrap sample of the training data. The method is called “random forest” since each split of each tree is built on a random subset of the predictors (Breiman 2001). Binary regression trees can be viewed as decision trees, each node being the criterion used to split the data and each final leaf giving the predicted value. For example, if a new set of predictors *x* (represented by a blue cross in Fig. 1) is available, one can find the final leaf of each tree that corresponds to the values of *x* and the associated observations . The conditional cdf introduced by Meinshausen (2006) can then be computed as

where the weights are deduced from the presence of in a final leaf of each tree when one follows the path determined by *x*. An example of a five-tree forest is given at the bottom of Fig. 1. The interested reader is referred to Taillardat et al. (2016) for an application of this approach to ensemble forecasts of temperature and wind speed.

### b. Gradient forests

Meinshausen (2006) proposed a splitting rule using CART regression splits. Arguing that this splitting rule is not tailored to the quantile regression context, Athey et al. (2016) proposed another optimization scheme. Instead of maximizing the variance heterogeneity of the children nodes, one maximizes the criterion:

where the indicator function is equal to one when is greater than the *q*th quantile of the observations of the parent node . The proposed criterion is the equivalent of a measure of homogeneity given a target functional. For example, in the special case of least squares regression, becomes , and becomes equivalent to . In this special case, a gradient tree is equivalent to building a standard CART regression tree.

The terminology of “gradient forests” (or also generalized random forests) is suggested because the choice of is here linked with a gradient-based approximation of the quantile function:

Indeed, for quantile regression, this target functional is nondifferentiable. This technique using gradients is computationally feasible (far cheaper than using the quantile function), an advantage not to be neglected when dealing with nonparametric techniques. Note here that for each split, the order of the quantile is chosen among given orders . A variant of Fig. 1 could have been provided just by modifying the lines used to split the observation set.

### c. Extension of the QRF and GF methods

As mentioned in the introduction, the predicted cdf defined by (6) cannot predict values which are not in the learning sample. This can be a strong limitation if the learning sample is small, or if rare events are of interest, or both. The GF method exhibits the same weakness. When the aim is to model rainfall parametrically, the EGP pdf defined by (4) appears to be a good candidate. It allows more flexibility in the fitting than CSG or CGEV, and encompasses their tail behaviors. Indeed, this distribution with four parameters, , and *ξ*, is able to represent both low and heavy rainfalls, and it provides satisfying results in practice (see, e.g., Naveau et al. 2016). In terms of inference, a simple, robust and fast method-of-moment can be applied. One cannot apply a CRPS minimization scheme here because at this stage one only has the predicted cdf coming from QRF and/or GF outputs. Probability weighted moments (PWM) of a given random variable, say *Y*, with survival function , can be expressed as

see Hosking and Wallis (1987). If *Y* follows an EGP pdf defined by (4), then we have

where represents the beta function. Knowing the PWM triplet is equivalent to know the parameter vector . Hence, we just need to estimate this moments’ triplet. For any given forest and for one raw forecast *x*, it is possible to estimate the distribution of by the empirical cdf defined by (6). Then, we can plug it into (8) to get

This leads to the estimates of and consequently of via (4). Note that the probability of no rain is just inferred by counting the number of dry events in the corresponding trees. In the following, this technique is called “EGP TAIL,” despite the fact that the whole distribution is fitted from QRF and GF trees. We present this method as a semiparametric one insofar as we use a sample coming from a nonparametric technique in order to fit a parametric distribution.

## 3. Ensemble model output statistics and EGP

For now, three definitions of parametric pdfs have been highlighted. We propose to test an EMOS scheme for each distribution. The setting of these methods are presented for CSG and CGEV. To the best of our knowledge, this is the first time that the EGP distribution is tested in an EMOS scheme. Table 3 summarizes the regression models that perform best (and thus are used) for each distribution. Note that different variable selection algorithms have been tried but did not improve the performance of these EMOS methods; see the appendix for further details.

### a. Censored-shifted gamma and censored generalized extreme value distributions

Different EMOS models have been proposed for the CSG and CGEV pdfs, respectively, defined by (1) and (2), by regressing their parameters on the ensemble values. More precisely, Baran and Nemoda (2016) used the CSG pdf by letting the mean and variance depend linearly on the raw ensemble values and their mean, respectively. The coefficients of this regression were estimated by minimizing the continuous ranked probability score (CRPS), refer to Hersbach (2000) and Scheuerer and Hamill (2015). The same strategy can be applied to fit the CGEV pdf (see, e.g., Hemri et al. 2014). Scheuerer (2014) modeled the scale parameter *σ* in (2) as an affine function of the ensemble mean absolute deviation rather than of the raw ensemble mean or variance. Another point to emphasize is that the shape parameter *ξ* was considered invariant in space in Hemri et al. (2014). The regression models used are sketched in Table 3.

### b. Extended generalized Pareto distribution

In this part, we explain how an EMOS approach can be built with the EGP pdf defined by (4), and we also point out common features and differences between the two EMOS schemes with CSG and CGEV. The scale parameter in (4) is estimated in the same way as with CGEV. The presence of the parameter allows an additional degree of freedom but may lead to a more difficult estimation of the parameters. Note that for the distribution defined by (4) the CRPS is

where and and denote the incomplete beta and the beta functions, respectively. The expectation of the EGP is mainly driven by the product . Consequently, we model as an affine function of the predictors divided by *σ*. As France has a diverse climate, it is not reasonable to assume a constant shape parameter among all locations, see Fig. 2. In addition, minimizing the CRPS to infer different shape parameters may be inefficient (see, e.g., Friederichs and Thorarinsdottir 2012). To estimate *ξ* at each location, we simply use the PWM inference scheme described in Papastathopoulos and Tawn (2013) and we apply it on the observed sample climatology. This first step is done before any calibration procedure. Then to complete the estimation of the parameters in (4), the probability *π* is modeled as an affine function on of the raw ensemble probability of rain and affine function parameters are also estimated by CRPS minimization.

## 4. Analog ensemble

Contrary to EMOS, this technique is data driven, since one does not make any parametric assumption on the variable to calibrate. An analog for a given location and forecast lead time is defined as a past prediction, from the same model, that has similar values for selected features of the current model forecast. Analog ensemble consists in finding these closest past forecasts according to a given metric of the predictors’ space to build an analog-based ensemble (see, e.g., Hamill and Whitaker 2006). We assume here that close forecasts lead to close observations. Making use of analogs requires to choose both the set of predictors and the metric. Concerning the metric, several have been tried as the Euclidean or the Mahalanobis distance, but they have been outperformed by the metric provided in Delle Monache et al. (2013):

where represents the current forecast at time *t* for a given location. The analog for another time at this same location is . The number of predictors is , and is half the time window used to search analogs. We standardize the distance by the standard deviation of each predictor calculated on the learning sample for the considered location. In this study we take so the time window is ±24 h of the forecast to calibrate. This distance has the advantage of being flow dependent and thus defines a real weather regime associated with the research of the analogs. Note that one could weigh the different predictors with , and we fixed for all predictors in a first method (analogs). Two other weighting techniques have been tested: one using (as in Zhou and Zhai 2016) the absolute value of the correlation coefficient between predictors and the response variable (Analogs_COR), and one based on the frequency of predictors’ occurrences in the variable selection algorithm described in the appendix (Analogs_VSF). A fourth method takes only the predictors coming from Table 1 (Analogs_C). Note finally that other weighting techniques have been considered (Horton et al. 2017; Keller et al. 2017), but we did not use them in this study because of their computational cost. In the same vein of Fig. 1, Fig. 3 provides an example of an analog ensemble method in a two-dimensional space.

## 5. Application on the PEARP ensemble prediction system

### a. Data description

Our rainfall dataset corresponds to 6-h rainfall amounts (“rr6”) produced by 86 French weather stations and the 35-member ensemble forecast system called PEARP (Descamps et al. 2015) at a 51-h lead time forecast for the initialization time of 1800 UTC. PEARP is the global ensemble prediction system of Météo-France, performing forecasts up to 4.5 days on 4 different initialization times. The ensemble uses a TL798C2.4 version and has 90 vertical levels from 14 m up to 1 hPa. This ensemble is hydrostatic and has a resolution of about 10 km over western Europe. Our period of interest spans 4 years from 1 January 2012 to 31 December 2015 at 86 French stations. The numerical models are interpolated on each station, and the rainfall observations are based on controlled rain gauges. In addition, about 70% of the stations are located in a marine climate (Köppen classification: Cfb). These stations are mostly represented with green dots in Fig. 2.

### b. Sets of predictors used

The choice of the predictors in Table 1 is coming from both studies of Hemri et al. (2014) and Scheuerer and Hamill (2015). This a classical set of predictors for the postprocessing of rainfall amounts. The whole set of available predictors is listed in Table 4. It includes summary statistics and probabilities from the rainfall ensemble, but also deterministic forecasts of variables highly linked with (potentially high) precipitation amounts, and statistics on other weather variables of the ensemble, in order to get information about predictability. As for the EMOS schemes, sets of predictors based on variable selection algorithms have been tested. They have not been considered for this study since the gain in predictive performance was negative. See the appendix for details.

### c. Verification of ensembles

We recall here some facts about the scores used in this study.

#### 1) Reliability

Reliability between observations and a predictive distribution can be checked by calculating , where *Y* is the observation and *F* is the cdf of the associated predictive distribution. Subject to calibration, the random variable has a standard uniform distribution (Gneiting and Katzfuss 2014), and we can check ensemble bias by comparing to 1/2 and ensemble dispersion by comparing the variance to 1/12. This approach is applied to a ranked ensemble forecast using the discrete random variable . Subject to calibration, *Z* has a discrete standard uniform distribution with and a normalized variance . Deviations from 0.5 (1.0) show positive or negative bias (over or underdispersion).

Another tool used to assess calibration is the entropy . For a calibrated system the entropy is maximum and equals 1. This index permits to rank forecasts according to their overall reliability. Tribus (1969) showed that the entropy is an indicator of reliability and Bayes’s factor linked to the Bayesian psi-test. It is also a proper measure of reliability used in the divergence score described in Weijs et al. (2010) and Roulston and Smith (2002).

These quantities are closely related to rank histograms (Hamill 2001) which are discrete version of probability integral transform (PIT) histograms. Moreover if one can assume the property of flatness of these histograms, Jolliffe and Primo (2008) exhibit a test accounting for the slope and the shape of rank histograms. In a recent work, Zamo (2016) extends this idea for accounting the presence of wave in histograms as seen in Scheuerer and Hamill (2015) and Taillardat et al. (2016). A more complete test can thus be implemented that tests each histogram for flatness. Such a test is called the Jolliffe–Primo–Zamo (JPZ) test.

#### 2) Scoring rules

Following Gneiting et al. (2007), Gneiting and Raftery (2007), and Bröcker and Smith (2007), scoring rules assign numerical scores to probabilistic forecasts and form attractive summary measures of predictive performance, since they address calibration and sharpness simultaneously. These scores are generally negatively oriented and we wish to minimize them. A *proper* scoring rule is designed such that the expected value of the score is minimized by the perfect forecast (i.e., when the observation is drawn from the same distribution than the predictive distribution). The continuous ranked probability score (CRPS) (Matheson and Winkler 1976; Hersbach 2000) is defined directly in terms of the predictive cdf *F* as

All CRPS have been computed following the recommendations of Zamo (2016, chapter 3). One can also define the skill score in term of CRPS between an ensemble prediction system A and a baseline B, as . The value of the CRPSS will be positive if and only if the system A is better than B for the CRPS scoring rule.

#### 3) Zooming on extremes

Finding a way to assess the quality of ensemble forecasts for extreme and rare events is quite difficult, as seen in Williams et al. (2014) in a comparison of ensemble calibration methods for extreme events. Weighted scoring rules can be adopted as done in Gneiting and Ranjan (2011) and Lerch et al. (2017), but there are two main issues. The ranking of compared methods depends on the weight function used, as already suggested in Gneiting and Ranjan (2011). Besides, giving a weight to such rare events avoid discriminant power of scoring rules, the same issue is also encountered for the Brier score (Brier 1950) for high thresholds. Moreover, reliability is not sound here since there are not enough extreme cases (by definition) to measure it. We have finally decided to focus on two ideas here, matching with forecasters’ desires: First, what is the discriminant power of our forecasts for extreme events in terms of binary decisions? Second, what is the risk of missing an extreme event? The ROC curves answer these questions. They are given for a specified rain event. Consider a fixed threshold *s* and the contingency table associated to the predictor . Recall that the ROC curve then plots the probability of detection (or hit rate) as a function of the probability of false detection (or false alarm rate). A “good” prediction must maximize hit rates and minimize false alarms (see, e.g., Jolliffe and Stephenson 2012). We prefer to pay attention to the *value* of a forecast more than to its *quality*. According to Murphy (1993), the *value* can be defined as the ability of the forecast to help users to take better decisions. The *quality* of a forecast can be summarized by the area on the *modeled* ROC curve (classically denoted by AUC), with some potential drawbacks exhibited by Lobo et al. (2008) and Hand (2009). Zhu et al. (2002) made a link between optimal decision thresholds, value and cost/loss ratios. In particular, they show that the value of a forecast is maximized for the “climatological” threshold and equals the hit rate minus the false alarm rate, which is the maximum of the Peirce skill score (Manzato 2007). The Peirce skill score is also known as the true skill statistic or the Hanssen and Kuipers discriminant. This quantity corresponds to the upper-left corner of ROC curves, and so it is used to assess forecast value.

### d. Framework

Verification has been performed over the entire forecast period. For a fair comparison, each method has to be tuned optimally. EMOS uses all the data available for each day (4 years less the forecast day as a training period). The same strategy is used to fit the analog ensemble. This leave-one-out cross validation could have led to overfitting, but the autocorrelations of the observed rainfall intensities are negligible (sample autocorrelation functions available upon request). QRF and GF employ a cross-validation method: each month of the 4 years is kept as validation data while the rest of the 4 years is used for learning. We use a leave-one-month-out cross validation for these methods. The same daily validation strategy as EMOS and analog ensemble has also been made but leads to insignificant improvements in overall performance. For each QRF/GF predicted cdf, a sample is drawn on these distributions to fit the EGP for tail extension methods. The tuning algorithm for EMOS is stopped after a few iterations in order to avoid overfitting, as suggested in Scheuerer (2014) concerning the parameter estimations. Table 3 sums up the optimal estimation strategies (with respect to CRPS averages on the verification period) that we have found for each distribution.

A total of 12 methods are competing, as sketched in Table 2: the raw ensemble, 4 analogs, 3 EMOS, 2 forest-based methods (1 QRF and 1 GF), and 2 tail-extended forest-based methods (1 QRF and 1 GF). A brief example of the effects of calibration on a single case is given in Fig. 4. Scores used concern (i) reliability performance, measured by the mean, the normalized variance, and the entropy of the rank histograms, denoted by in the sequel; (ii) global performance (calibration and sharpness) measured by the CRPS; (iii) gain in CRPS compared to the raw ensemble, measured by the skill of the CRPS using the raw ensemble as baseline; and (iv) maximum of the Peirce skill score among rainfall intensities.

## 6. Results

For the assessment of reliability, Table 5 provides measures of bias, dispersion, and flatness of rank histograms. Moreover, Fig. 5 exhibits the rank histograms on the 86 stations and the compliance of the method to the JPZ test. The raw ensemble is clearly biased and underdispersed. The majority of the methods shows a noteworthy reliability. The tail extension methods are slightly overdispersed, which is confirmed both by the dome-shaped rank histograms and by the measures of variance and entropy of rank histograms. This slight overdispersion may be due to the method that tends to spread out QRF and GF outputs, the latter being themselves well dispersed. EMOS_EGP still shows bias and underdispersion, despite its potential ability to model both low and high rainfall intensities. A possible explanation for this could be that adding parameters to estimate [especially in (4)] makes an efficient fit more difficult to obtain. This phenomenon can also be linked to the fact that a reduced set of predictors is optimal for EMOS: the less coefficients one has to estimate, the better one fits the distribution.

The global performance of postprocessing is illustrated by the averages of CRPS and CRPSS in Table 5. Regarding the significance of the results, note that the maximal estimation error of the CRPS is under 6.1 × 10^{−3}. The four analog ensemble show a quite poor CRPS, even if they exhibit reliability. Nevertheless we can notice that a weighting of the predictors, especially with a nonlinear variable selection algorithm (Analogs_VSF), improves this method. The EMOS postprocessed ensembles share with QRF and GF an improved CRPS. The tail-extended methods further improve in CRPS, which can be explained by an improved skill for extreme events.

To summarize, the best improvement with respect to the raw ensemble is for the forest-based methods, according to the CRPSS. This improvement is however less significant (around 10% in this study) than for other weather variables (see Taillardat et al. 2016), where the postprocessed ensembles have been improved by 70% for surface temperature and 50% for wind speed according to the CRPSS. This corroborates Hemri et al. (2014)’s conclusion that rainfall amounts are challenging to calibrate. Compared to EMOS, GF and hybrid tail methods have an improvement in CRPSS near 2%. This is less than in Taillardat et al. (2016). One should not forget here that the CRPS is influenced by all the cases where rain is not observed and not predicted (no rain frequency is 77% in our dataset). This leads to scores that are artificially tightened. The analog ensemble looks less performant, which may be imputable to the data depth of only 4 years in our study. Indeed, this latter nonparametric technique is data driven (as are also QRF and GF), and needs more data to be effective (see, e.g., Van den Dool 1994).

Concerning heavier intensities, Fig. 6 shows the benefit of the tail extension for forest-based methods. Several features that can be seen in Fig. 7 can be observed in Fig. 6: analogs lack resolution (which is logical insofar as the CRPS is poor despite reliability); the other postprocessed methods compete favorably with the raw ensemble, even for methods that cannot extrapolate observed values, such as QRF and GF. Figure 8 also enlightens on the value of ensemble predictions over rainfall intensities. The analog ensemble methods suffer from poor resolution and thus have poor skill. Note that above the threshold 15 mm (6 h)^{−1}, the methods that parameterize heavy tails (EMOS_GEV, EMOS_EGP, QRF EGP TAIL, and GF EGP TAIL) prevail. Tail extension methods show their gain in this latter context of binary decision.

## 7. Discussion

Throughout this study, we see that forest-based techniques compete favorably with EMOS techniques. It is encouraging to see that QRF and GF compared to EMOS nearly reach the same kind of improvement when focusing on rainfall amounts or on temperature and wind speed [see Taillardat et al. (2016), their Figs. 6 and 13]. It could be interesting to check these methods (especially GF) on smoother variables. One should remember that the scores are influenced by the high proportion of well-forecasted dry events. Thus, one can say that the difficulty of calibrating rainfall is as much a matter of score behavior as of predictability of the variable itself.

Inspired by the paradigm of Gneiting et al. (2007), *maximizing the sharpness of the predictive distributions subject to calibration*, the leitmotive of this paper could be summarized as *maximizing the value of predictive distributions for high-impact events subject to a satisfying overall performance*. In this context, tail extension of QRF and GF generates ensembles more tailored for catching higher rainfall intensities. In addition, reliability as well as resolution stand quite stable when extending the tail, so that our paradigm introduced above remains. The tail extension can be viewed as a semiparametric technique where the result of forest-based methods is used to fit a distribution. This kind of procedure can be connected to the work of Junk et al. (2015) who uses analogs on EMOS inputs. An interesting prospect would be to bring forest-based methods in this context.

One of the advantages of distribution-free calibration (analog ensemble, QRF, and GF) is that there is no assumption on the variables to calibrate. This statement is emphasized for rainfall amounts for which EMOS techniques have to be studied using different distributions, inference schemes, and set of predictors. In this sense, the recent mixing method of Baran and Lerch (2016) looks appealing. An alternative solution may consist of calibrating (standardized) anomalies of precipitation rather than precipitation itself. Such an idea is investigated in Dabernig et al. (2017).

Another positive aspect of the forest-based methods is that there is no need of a predictor selection. Concerning the analog ensemble, our results suggest that the work of Genuer et al. (2010) could be a cheaper alternative to brute force algorithms like in Keller et al. (2017) for the weighting of predictors. For analogs techniques, the complete set of predictors gives the best results. In contrast, the choice of the set of predictors is still an ongoing issue for EMOS techniques regarding precipitation. For easier variables to calibrate, Messner et al. (2017) shows that some variable selection can be effective for the parametric postprocessing of temperature.

A natural perspective regarding spatial calibration and trajectory recovery could be to make use of block regression techniques as done in Zamo et al. (2016), or of ensemble copula coupling and Schaake shuffle, as suggested by Bremnes (2007), Schefzik (2016), and van Straaten et al. (2018). For further studies on forest-based methods, it would be desirable to work with radar precipitations intensities. This study makes use of rain gauges data, which limits the range of (heavy) rainfall captured, and further studies and verification could be done within an observation-gridded calibration work. Finally, it appears that more and more weather services are working on merging different forecasts from different sources (multimodel ensembles). In this context, an attractive procedure could be to combine raw ensembles and different methods of postprocessing via sequential aggregation (Mallet 2010; Thorey et al. 2017), in order to further improve forecasts.

## Acknowledgments

Part of the work of P. Naveau has been supported by the ANR-DADA, LEFE-INSU-Multirisk, AMERISKA, A2C2, CHAVANA and Extremoscope projects. This work has been supported by Energy oriented Centre of Excellence (EoCoE), Grant Agreement 676629, funded within the Horizon2020 framework of the European Union. This work has been supported by the LABEX MILYON (ANR-10-LABX-0070) of Université de Lyon, within the program “Investissements d’Avenir” (ANR-11-IDEX-0007) operated by the French National Research Agency (ANR). Thanks to Julie Tibshirani, Susan Athey, and Stefan Wager for providing gradient-forest source package. Funding information: LABEX MILYON, Investissements d’Avenir and DADA, ANR, Grants ANR-10-LABX-0070, ANR-11-IDEX-0007, and ANR-13-JS06-0007; EoCoE, Horizon2020, Grant 676629; A2C2, ERC, Grant 338965.

### APPENDIX

#### Variable Selection for EMOS and Analog Ensemble Weighting

Most of the parameters in EMOS and the distance used in analogs can be inferred using different sets of predictors. Contrary to the QRF and GF methods where the addition of a useless predictor does not usually affect the predictive performance (under several constraints on the number of informative predictors), EMOS and analogs are more sensitive to it. Methods that keep the most informative meteorological variables and guarantee the best predictive performance have been investigated.

A natural choice is to consider the classical Akaike information criterion and the Bayesian information criterion (Schwarz 1978; Akaike 1998) but it resulted that the selection kept too many predictors in the initial set. This is an issue for EMOS methods. For example, the location parameter of EMOS distributions is a linear fit of the predictors. This fit should not rely on a lot of predictors in order to stay a well-conditioned problem. This critical aspect of numerical analysis is probably why Scheuerer (2014) proposes to stop the coefficient estimation after a few iterations. Moreover, the predictors used to be highly correlated, and suffer from colinearity. Finally, the set of predictors that shows the best performance in EMOS is detailed in Table 1 and confirms the choice made by previous studies on rainfall postprocessing.

The algorithm of Genuer et al. (2010) has also been considered. Such an algorithm is appealing since it uses random forests and it permits retaining predictors without redundancy of information. For example this algorithm can eliminate correlated predictors even if they are informative. A reduced set of predictors is generally obtained, which avoids misestimation generated by multicolinearity. The method of variable selection used here is one among plenty others. Readers interested in variable selection using random forests can refer to Genuer et al. (2010) for detailed explanations. To be more precise, the variable selection algorithm is used to keep the first predictors (a maximum 4 of them) that form the set of predictors for each location. Figure A1 shows the ranked frequency of each chosen predictor, and it is used for the Analogs_VSF method. Predictors that are never retained are not in this figure. We can see here that only one-third of the whole set of predictors are retained at least in 10% of the cases. Moreover, predictors representing central and extreme tendencies are preferred. Some predictors appear that differ from rainfall amounts; see CAPE, FX, or HU (Table 4). It is not surprising since these parameters are correlated with storms. It is not shown here but when the MEAN variable is not selected, either MED or CTRL stands in the set. This shows that the algorithm mostly avoids potential correlations among predictors. So the results concerning the variable algorithm selection seem to be sound. Last but not least, one notices that the predictors of Table 1 are often chosen. This remark confirms both the robustness of the algorithm and the relevance of previous studies on precipitation concerning the choice of the predictors.

## REFERENCES

*Selected Papers of Hirotugu Akaike*, E. Parzen, K. Tanabe, and G. Kitagawa, Eds., Springer, 199–213.

*Classification and Regression Trees*. CRC Press, 369 pp.

*Extreme Value Theory: An Introduction*. Springer, 418 pp.

*Statistical Postprocessing of Ensemble Forecasts*, S. Vannitsem, D. Wilks, and J. Messner, Eds., Elsevier, 187–217.

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science*. John Wiley & Sons, 543 pp.

*Rational Descriptions, Decisions and Designs*. Pergamon Press, 478 pp.

## Footnotes

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).