1. Introduction
Accurately forecasting weather is paramount for a wide range of end-users (e.g., air traffic controllers, emergency managers, and energy providers) (see, e.g., Pinson et al. 2007; Zamo et al. 2014). In meteorology, ensemble forecasts try to quantify forecast uncertainties due to observation errors and incomplete physical representation of the atmosphere. Despite its recent developments in national meteorological services, ensemble forecasts still suffer from bias and underdispersion (see, e.g., Hamill and Colucci 1997; Hagedorn et al. 2012; Baran and Lerch 2018). Consequently, they need to be postprocessed in order to alleviate bias and misdispersion (Hamill 2019). Moreover, not all meteorological variables are equal in terms of forecast and calibration. In particular, Hemri et al. (2014) highlighted the steep hill in skillful rainfall forecasting.
a. Existing postprocessing methods and their rainfall-specific applications
At least two types of statistical methods have emerged in the last decades: analog ensemble (e.g., Delle Monache et al. 2013; Junk et al. 2015; Keller et al. 2017; Hamill and Whitaker 2006; Alessandrini et al. 2015; Eckel and Delle Monache 2016) and ensemble model output statistics (EMOS; see, e.g., Wilks 2015; Scheuerer and König 2014; Scheuerer and Möller 2015; Gneiting et al. 2005). The analog ensemble is fully nonparametric and consists of finding similar atmospheric situations in the past and using them to improve the present forecast. In contrast, EMOS belongs to the family of parametric regression schemes. If y represents the weather variable of interest and
Less conventional approaches have also been studied recently. For example, Van Schaeybroeck and Vannitsem (2015) investigated member-by-member postprocessing techniques, and Taillardat et al. (2016) found that quantile regression forests (QRF) techniques performed well for temperature and wind speed forecasting.
Modeling precipitation distributions is a challenge by itself. It is a mixture of zeros (dry events) and positive intensities (i.e., rainfall amounts for wet events). The latter have a skewed distribution. For daily precipitation, extended logistic regression is frequently applied (see, e.g., Hamill et al. 2008; Roulin and Vannitsem 2012; Ben Bouallègue 2013). Bayesian model averaging techniques (Raftery et al. 2005; Sloughter et al. 2007) were also used in rainfall forecasting, with a gamma fit often applied to cube root transformed precipitation accumulations. As for analogs and EMOS techniques, they have been applied to calibrate daily rainfall (see, e.g., Hamill and Whitaker 2006; Scheuerer 2014; Scheuerer and Hamill 2015).





Subset representing the most classical predictors.








b. Forest-based methods in numerical weather prediction
To our knowledge, the first application of random forest algorithms (Breiman 2001) in the area of numerical weather prediction occurred in 2009. Gagne et al. (2009) used this technique for the classification of convective areas, and later on, to produce hail occurrence forecasts (Gagne et al. 2017). Recently, Herman and Schumacher (2018) apply random forest to probabilistic forecasts of extreme precipitation. Zamo et al. (2014) use both random forests and QRF for photovoltaic electricity production, using meteorological variables. In a postprocessing context, Taillardat et al. (2016) show that QRF performs better than EMOS for the postprocessing of temperature and wind speed. One may wonder if QRF could favorably compete with EMOS and analog ensemble for rainfall calibration. This question is particularly relevant since methodological advances have been recently made concerning random forests and quantile regressions. In particular, Athey et al. (2016) proposed an innovative way, called gradient forests (GF), of using forests for quantile regression.
c. Limits of forest-based methods, and a semiparametric procedure for rainfall calibration
As pointed out in Taillardat et al. (2016) and Rasp and Lerch (2018), one drawback of data-driven approaches like QRF and GF is that their intrinsic nonparametric nature make them useless to predict beyond the largest recorded rainfall. To circumvent this limit, we propose here to combine random forest techniques with a parametric distribution.





Hence, local random forest-based postprocessing techniques are aimed at predicting extreme events, and this will provide an interesting path to improve prediction beyond the largest values of the sample at hand.
d. Main contributions of this paper and outline
In this study, we will focus on 6-h rainfall amounts in France (denoted later by “rr6”) because this is the unit of interest of the ensemble forecast system of Météo-France. We propose to implement and test QRF and GF methods (and their semiparametric extensions) for station-wise rainfall calibration and compare them with other approaches such as EMOS and analog ensemble. The EMOS approach is implemented with three different parametric pdfs, respectively defined by (1), (2), and (4). Besides comparing these three EMOS models, it is natural to wonder if other nonparametric approaches such as analog ensemble are competitive.
This article is organized as follows. In section 2, we recall the basic ingredients to create quantile regression forests and gradient forests, as well as the calibration process recently introduced by Athey et al. (2016) for quantile regression through GF. The way the trees are combined with the EGP pdf defined by (4) is then detailed. In section 3, the state-of-the-art EMOS procedures are sketched, and how the EGP pdf is integrated within an EMOS scheme is explained. Section 4 presents the analog ensemble technique. The different approaches are summarized in Table 2 and implemented in section 5. Therein, the test bed dataset of 86 French weather stations and the French ensemble forecast system of Météo-France called Prévision d’Ensemble ARPEGE (PEARP; Descamps et al. 2015) are described, and the verification tools used in this study are presented. Then, we assess and compare each method with a special interest for heavy rainfall, see section 6. The paper closes with a discussion in section 7.
Postprocessing methods involved in this study.


2. Quantile regression forests, gradient forests, and semiparametric tail extension
a. Quantile regression forests
Given a sample of predictors–response pairs, say




















Two-dimension example of a (top) binary regression tree and (bottom) five-tree forest. A binary decision tree is built from a bootstrap sample of the data at hand. Successive dichotomies (lines splitting the plane) are made according to a criterion based on observations’ homogeneity. For a new set of predictors (the blue cross), the path leading to the corresponding observations is followed. The predicted CDF is the aggregation of the result of each tree.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

Two-dimension example of a (top) binary regression tree and (bottom) five-tree forest. A binary decision tree is built from a bootstrap sample of the data at hand. Successive dichotomies (lines splitting the plane) are made according to a criterion based on observations’ homogeneity. For a new set of predictors (the blue cross), the path leading to the corresponding observations is followed. The predicted CDF is the aggregation of the result of each tree.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
Two-dimension example of a (top) binary regression tree and (bottom) five-tree forest. A binary decision tree is built from a bootstrap sample of the data at hand. Successive dichotomies (lines splitting the plane) are made according to a criterion based on observations’ homogeneity. For a new set of predictors (the blue cross), the path leading to the corresponding observations is followed. The predicted CDF is the aggregation of the result of each tree.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
b. Gradient forests














c. Extension of the QRF and GF methods




















3. Ensemble model output statistics and EGP
For now, three definitions of parametric pdfs have been highlighted. We propose to test an EMOS scheme for each distribution. The setting of these methods are presented for CSG and CGEV. To the best of our knowledge, this is the first time that the EGP distribution is tested in an EMOS scheme. Table 3 summarizes the regression models that perform best (and thus are used) for each distribution. Note that different variable selection algorithms have been tried but did not improve the performance of these EMOS methods; see the appendix for further details.
Optimal strategies (with respect to CRPS averages on the verification period) for parameter estimation and choice of variables in the EMOS framework. MAD stands for mean absolute deviation of the raw ensemble. These findings corroborate previous studies on rainfall postprocessing.


a. Censored-shifted gamma and censored generalized extreme value distributions
Different EMOS models have been proposed for the CSG and CGEV pdfs, respectively, defined by (1) and (2), by regressing their parameters on the ensemble values. More precisely, Baran and Nemoda (2016) used the CSG pdf by letting the mean
b. Extended generalized Pareto distribution











Spatial values of ξ among locations. These parameters are computed using the 4 years of observations available. The estimation is made via the probability weighted moments method described in Papastathopoulos and Tawn (2013).
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

Spatial values of ξ among locations. These parameters are computed using the 4 years of observations available. The estimation is made via the probability weighted moments method described in Papastathopoulos and Tawn (2013).
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
Spatial values of ξ among locations. These parameters are computed using the 4 years of observations available. The estimation is made via the probability weighted moments method described in Papastathopoulos and Tawn (2013).
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
4. Analog ensemble













Two-dimension example of an analog ensemble method. For a new set of predictors (the blue cross), the closest analog observations according to a given distance are kept to get the predictive distribution.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

Two-dimension example of an analog ensemble method. For a new set of predictors (the blue cross), the closest analog observations according to a given distance are kept to get the predictive distribution.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
Two-dimension example of an analog ensemble method. For a new set of predictors (the blue cross), the closest analog observations according to a given distance are kept to get the predictive distribution.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
5. Application on the PEARP ensemble prediction system
a. Data description
Our rainfall dataset corresponds to 6-h rainfall amounts (“rr6”) produced by 86 French weather stations and the 35-member ensemble forecast system called PEARP (Descamps et al. 2015) at a 51-h lead time forecast for the initialization time of 1800 UTC. PEARP is the global ensemble prediction system of Météo-France, performing forecasts up to 4.5 days on 4 different initialization times. The ensemble uses a TL798C2.4 version and has 90 vertical levels from 14 m up to 1 hPa. This ensemble is hydrostatic and has a resolution of about 10 km over western Europe. Our period of interest spans 4 years from 1 January 2012 to 31 December 2015 at 86 French stations. The numerical models are interpolated on each station, and the rainfall observations are based on controlled rain gauges. In addition, about 70% of the stations are located in a marine climate (Köppen classification: Cfb). These stations are mostly represented with green dots in Fig. 2.
b. Sets of predictors used
The choice of the predictors in Table 1 is coming from both studies of Hemri et al. (2014) and Scheuerer and Hamill (2015). This a classical set of predictors for the postprocessing of rainfall amounts. The whole set of available predictors is listed in Table 4. It includes summary statistics and probabilities from the rainfall ensemble, but also deterministic forecasts of variables highly linked with (potentially high) precipitation amounts, and statistics on other weather variables of the ensemble, in order to get information about predictability. As for the EMOS schemes, sets of predictors based on variable selection algorithms have been tested. They have not been considered for this study since the gain in predictive performance was negative. See the appendix for details.
Set of all available predictors.


c. Verification of ensembles
We recall here some facts about the scores used in this study.
1) Reliability
Reliability between observations and a predictive distribution can be checked by calculating
Another tool used to assess calibration is the entropy
These quantities are closely related to rank histograms (Hamill 2001) which are discrete version of probability integral transform (PIT) histograms. Moreover if one can assume the property of flatness of these histograms, Jolliffe and Primo (2008) exhibit a test accounting for the slope and the shape of rank histograms. In a recent work, Zamo (2016) extends this idea for accounting the presence of wave in histograms as seen in Scheuerer and Hamill (2015) and Taillardat et al. (2016). A more complete test can thus be implemented that tests each histogram for flatness. Such a test is called the Jolliffe–Primo–Zamo (JPZ) test.
2) Scoring rules



3) Zooming on extremes
Finding a way to assess the quality of ensemble forecasts for extreme and rare events is quite difficult, as seen in Williams et al. (2014) in a comparison of ensemble calibration methods for extreme events. Weighted scoring rules can be adopted as done in Gneiting and Ranjan (2011) and Lerch et al. (2017), but there are two main issues. The ranking of compared methods depends on the weight function used, as already suggested in Gneiting and Ranjan (2011). Besides, giving a weight to such rare events avoid discriminant power of scoring rules, the same issue is also encountered for the Brier score (Brier 1950) for high thresholds. Moreover, reliability is not sound here since there are not enough extreme cases (by definition) to measure it. We have finally decided to focus on two ideas here, matching with forecasters’ desires: First, what is the discriminant power of our forecasts for extreme events in terms of binary decisions? Second, what is the risk of missing an extreme event? The ROC curves answer these questions. They are given for a specified rain event. Consider a fixed threshold s and the contingency table associated to the predictor
d. Framework
Verification has been performed over the entire forecast period. For a fair comparison, each method has to be tuned optimally. EMOS uses all the data available for each day (4 years less the forecast day as a training period). The same strategy is used to fit the analog ensemble. This leave-one-out cross validation could have led to overfitting, but the autocorrelations of the observed rainfall intensities are negligible (sample autocorrelation functions available upon request). QRF and GF employ a cross-validation method: each month of the 4 years is kept as validation data while the rest of the 4 years is used for learning. We use a leave-one-month-out cross validation for these methods. The same daily validation strategy as EMOS and analog ensemble has also been made but leads to insignificant improvements in overall performance. For each QRF/GF predicted cdf, a sample is drawn on these distributions to fit the EGP for tail extension methods. The tuning algorithm for EMOS is stopped after a few iterations in order to avoid overfitting, as suggested in Scheuerer (2014) concerning the parameter estimations. Table 3 sums up the optimal estimation strategies (with respect to CRPS averages on the verification period) that we have found for each distribution.
A total of 12 methods are competing, as sketched in Table 2: the raw ensemble, 4 analogs, 3 EMOS, 2 forest-based methods (1 QRF and 1 GF), and 2 tail-extended forest-based methods (1 QRF and 1 GF). A brief example of the effects of calibration on a single case is given in Fig. 4. Scores used concern (i) reliability performance, measured by the mean, the normalized variance, and the entropy of the rank histograms, denoted by

Effects of the calibration on a single forecast. Only one method of each family is shown for the clarity of the figure. The densities of nonparametric methods are represented using Gaussian kernels. The analog ensemble shows some wiggles. The EMOS_GEV fits the corresponding distribution. The bulk of the QRF EGP TAIL density is noticeable and leads to a heavier distribution tail.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

Effects of the calibration on a single forecast. Only one method of each family is shown for the clarity of the figure. The densities of nonparametric methods are represented using Gaussian kernels. The analog ensemble shows some wiggles. The EMOS_GEV fits the corresponding distribution. The bulk of the QRF EGP TAIL density is noticeable and leads to a heavier distribution tail.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
Effects of the calibration on a single forecast. Only one method of each family is shown for the clarity of the figure. The densities of nonparametric methods are represented using Gaussian kernels. The analog ensemble shows some wiggles. The EMOS_GEV fits the corresponding distribution. The bulk of the QRF EGP TAIL density is noticeable and leads to a heavier distribution tail.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
6. Results
For the assessment of reliability, Table 5 provides measures of bias, dispersion, and flatness of rank histograms. Moreover, Fig. 5 exhibits the rank histograms on the 86 stations and the compliance of the method to the JPZ test. The raw ensemble is clearly biased and underdispersed. The majority of the methods shows a noteworthy reliability. The tail extension methods are slightly overdispersed, which is confirmed both by the dome-shaped rank histograms and by the measures of variance and entropy of rank histograms. This slight overdispersion may be due to the method that tends to spread out QRF and GF outputs, the latter being themselves well dispersed. EMOS_EGP still shows bias and underdispersion, despite its potential ability to model both low and high rainfall intensities. A possible explanation for this could be that adding parameters to estimate [especially
Comparing performance statistics for different postprocessing methods for 6-h rainfall forecasts in France. The mean CRPS estimations come from bootstrap replicates, and the estimation error is less than



Boxplots of rank histograms for each technique according to the locations. The proportion of rank histograms for which the JPZ test does not reject the flatness hypothesis is also provided. The results confirm Table 5.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

Boxplots of rank histograms for each technique according to the locations. The proportion of rank histograms for which the JPZ test does not reject the flatness hypothesis is also provided. The results confirm Table 5.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
Boxplots of rank histograms for each technique according to the locations. The proportion of rank histograms for which the JPZ test does not reject the flatness hypothesis is also provided. The results confirm Table 5.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
The global performance of postprocessing is illustrated by the averages of CRPS and CRPSS in Table 5. Regarding the significance of the results, note that the maximal estimation error of the CRPS is under 6.1 × 10−3. The four analog ensemble show a quite poor CRPS, even if they exhibit reliability. Nevertheless we can notice that a weighting of the predictors, especially with a nonlinear variable selection algorithm (Analogs_VSF), improves this method. The EMOS postprocessed ensembles share with QRF and GF an improved CRPS. The tail-extended methods further improve in CRPS, which can be explained by an improved skill for extreme events.
To summarize, the best improvement with respect to the raw ensemble is for the forest-based methods, according to the CRPSS. This improvement is however less significant (around 10% in this study) than for other weather variables (see Taillardat et al. 2016), where the postprocessed ensembles have been improved by 70% for surface temperature and 50% for wind speed according to the CRPSS. This corroborates Hemri et al. (2014)’s conclusion that rainfall amounts are challenging to calibrate. Compared to EMOS, GF and hybrid tail methods have an improvement in CRPSS near 2%. This is less than in Taillardat et al. (2016). One should not forget here that the CRPS is influenced by all the cases where rain is not observed and not predicted (no rain frequency is 77% in our dataset). This leads to scores that are artificially tightened. The analog ensemble looks less performant, which may be imputable to the data depth of only 4 years in our study. Indeed, this latter nonparametric technique is data driven (as are also QRF and GF), and needs more data to be effective (see, e.g., Van den Dool 1994).
Concerning heavier intensities, Fig. 6 shows the benefit of the tail extension for forest-based methods. Several features that can be seen in Fig. 7 can be observed in Fig. 6: analogs lack resolution (which is logical insofar as the CRPS is poor despite reliability); the other postprocessed methods compete favorably with the raw ensemble, even for methods that cannot extrapolate observed values, such as QRF and GF. Figure 8 also enlightens on the value of ensemble predictions over rainfall intensities. The analog ensemble methods suffer from poor resolution and thus have poor skill. Note that above the threshold 15 mm (6 h)−1, the methods that parameterize heavy tails (EMOS_GEV, EMOS_EGP, QRF EGP TAIL, and GF EGP TAIL) prevail. Tail extension methods show their gain in this latter context of binary decision.

ROC curves for the “rr6 >15 mm” rain event, corresponding to exceeding the quantile of order 0.995. A “good” prediction must maximize hit rate and minimize false alarms. The analog ensemble lacks resolution. Tail extension methods show their gain in a binary decision context.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

ROC curves for the “rr6 >15 mm” rain event, corresponding to exceeding the quantile of order 0.995. A “good” prediction must maximize hit rate and minimize false alarms. The analog ensemble lacks resolution. Tail extension methods show their gain in a binary decision context.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
ROC curves for the “rr6 >15 mm” rain event, corresponding to exceeding the quantile of order 0.995. A “good” prediction must maximize hit rate and minimize false alarms. The analog ensemble lacks resolution. Tail extension methods show their gain in a binary decision context.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

ROC curves for the “rr6 > 0.2 mm” rain event. A “good” prediction must maximize hit rate and minimize false alarms. The analog ensemble lacks resolution. Note that there is no improvement of postprocessed methods compared to the raw ensemble.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

ROC curves for the “rr6 > 0.2 mm” rain event. A “good” prediction must maximize hit rate and minimize false alarms. The analog ensemble lacks resolution. Note that there is no improvement of postprocessed methods compared to the raw ensemble.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
ROC curves for the “rr6 > 0.2 mm” rain event. A “good” prediction must maximize hit rate and minimize false alarms. The analog ensemble lacks resolution. Note that there is no improvement of postprocessed methods compared to the raw ensemble.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

Maximum of the Peirce skill score among thresholds. This score assesses the value of ensemble forecasts. Above the threshold of 15 mm (6 h)−1, the methods that modelize heavy tails (EMOS_GEV, EMOS_EGP, QRF EGP TAIL, and GF EGP TAIL) prevail.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

Maximum of the Peirce skill score among thresholds. This score assesses the value of ensemble forecasts. Above the threshold of 15 mm (6 h)−1, the methods that modelize heavy tails (EMOS_GEV, EMOS_EGP, QRF EGP TAIL, and GF EGP TAIL) prevail.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
Maximum of the Peirce skill score among thresholds. This score assesses the value of ensemble forecasts. Above the threshold of 15 mm (6 h)−1, the methods that modelize heavy tails (EMOS_GEV, EMOS_EGP, QRF EGP TAIL, and GF EGP TAIL) prevail.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
7. Discussion
Throughout this study, we see that forest-based techniques compete favorably with EMOS techniques. It is encouraging to see that QRF and GF compared to EMOS nearly reach the same kind of improvement when focusing on rainfall amounts or on temperature and wind speed [see Taillardat et al. (2016), their Figs. 6 and 13]. It could be interesting to check these methods (especially GF) on smoother variables. One should remember that the scores are influenced by the high proportion of well-forecasted dry events. Thus, one can say that the difficulty of calibrating rainfall is as much a matter of score behavior as of predictability of the variable itself.
Inspired by the paradigm of Gneiting et al. (2007), maximizing the sharpness of the predictive distributions subject to calibration, the leitmotive of this paper could be summarized as maximizing the value of predictive distributions for high-impact events subject to a satisfying overall performance. In this context, tail extension of QRF and GF generates ensembles more tailored for catching higher rainfall intensities. In addition, reliability as well as resolution stand quite stable when extending the tail, so that our paradigm introduced above remains. The tail extension can be viewed as a semiparametric technique where the result of forest-based methods is used to fit a distribution. This kind of procedure can be connected to the work of Junk et al. (2015) who uses analogs on EMOS inputs. An interesting prospect would be to bring forest-based methods in this context.
One of the advantages of distribution-free calibration (analog ensemble, QRF, and GF) is that there is no assumption on the variables to calibrate. This statement is emphasized for rainfall amounts for which EMOS techniques have to be studied using different distributions, inference schemes, and set of predictors. In this sense, the recent mixing method of Baran and Lerch (2016) looks appealing. An alternative solution may consist of calibrating (standardized) anomalies of precipitation rather than precipitation itself. Such an idea is investigated in Dabernig et al. (2017).
Another positive aspect of the forest-based methods is that there is no need of a predictor selection. Concerning the analog ensemble, our results suggest that the work of Genuer et al. (2010) could be a cheaper alternative to brute force algorithms like in Keller et al. (2017) for the weighting of predictors. For analogs techniques, the complete set of predictors gives the best results. In contrast, the choice of the set of predictors is still an ongoing issue for EMOS techniques regarding precipitation. For easier variables to calibrate, Messner et al. (2017) shows that some variable selection can be effective for the parametric postprocessing of temperature.
A natural perspective regarding spatial calibration and trajectory recovery could be to make use of block regression techniques as done in Zamo et al. (2016), or of ensemble copula coupling and Schaake shuffle, as suggested by Bremnes (2007), Schefzik (2016), and van Straaten et al. (2018). For further studies on forest-based methods, it would be desirable to work with radar precipitations intensities. This study makes use of rain gauges data, which limits the range of (heavy) rainfall captured, and further studies and verification could be done within an observation-gridded calibration work. Finally, it appears that more and more weather services are working on merging different forecasts from different sources (multimodel ensembles). In this context, an attractive procedure could be to combine raw ensembles and different methods of postprocessing via sequential aggregation (Mallet 2010; Thorey et al. 2017), in order to further improve forecasts.
Acknowledgments
Part of the work of P. Naveau has been supported by the ANR-DADA, LEFE-INSU-Multirisk, AMERISKA, A2C2, CHAVANA and Extremoscope projects. This work has been supported by Energy oriented Centre of Excellence (EoCoE), Grant Agreement 676629, funded within the Horizon2020 framework of the European Union. This work has been supported by the LABEX MILYON (ANR-10-LABX-0070) of Université de Lyon, within the program “Investissements d’Avenir” (ANR-11-IDEX-0007) operated by the French National Research Agency (ANR). Thanks to Julie Tibshirani, Susan Athey, and Stefan Wager for providing gradient-forest source package. Funding information: LABEX MILYON, Investissements d’Avenir and DADA, ANR, Grants ANR-10-LABX-0070, ANR-11-IDEX-0007, and ANR-13-JS06-0007; EoCoE, Horizon2020, Grant 676629; A2C2, ERC, Grant 338965.
APPENDIX
Variable Selection for EMOS and Analog Ensemble Weighting
Most of the parameters in EMOS and the distance used in analogs can be inferred using different sets of predictors. Contrary to the QRF and GF methods where the addition of a useless predictor does not usually affect the predictive performance (under several constraints on the number of informative predictors), EMOS and analogs are more sensitive to it. Methods that keep the most informative meteorological variables and guarantee the best predictive performance have been investigated.
A natural choice is to consider the classical Akaike information criterion and the Bayesian information criterion (Schwarz 1978; Akaike 1998) but it resulted that the selection kept too many predictors in the initial set. This is an issue for EMOS methods. For example, the location parameter of EMOS distributions is a linear fit of the predictors. This fit should not rely on a lot of predictors in order to stay a well-conditioned problem. This critical aspect of numerical analysis is probably why Scheuerer (2014) proposes to stop the coefficient estimation after a few iterations. Moreover, the predictors used to be highly correlated, and suffer from colinearity. Finally, the set of predictors that shows the best performance in EMOS is detailed in Table 1 and confirms the choice made by previous studies on rainfall postprocessing.
The algorithm of Genuer et al. (2010) has also been considered. Such an algorithm is appealing since it uses random forests and it permits retaining predictors without redundancy of information. For example this algorithm can eliminate correlated predictors even if they are informative. A reduced set of predictors is generally obtained, which avoids misestimation generated by multicolinearity. The method of variable selection used here is one among plenty others. Readers interested in variable selection using random forests can refer to Genuer et al. (2010) for detailed explanations. To be more precise, the variable selection algorithm is used to keep the first predictors (a maximum 4 of them) that form the set of predictors for each location. Figure A1 shows the ranked frequency of each chosen predictor, and it is used for the Analogs_VSF method. Predictors that are never retained are not in this figure. We can see here that only one-third of the whole set of predictors are retained at least in 10% of the cases. Moreover, predictors representing central and extreme tendencies are preferred. Some predictors appear that differ from rainfall amounts; see CAPE, FX, or HU (Table 4). It is not surprising since these parameters are correlated with storms. It is not shown here but when the MEAN variable is not selected, either MED or CTRL stands in the set. This shows that the algorithm mostly avoids potential correlations among predictors. So the results concerning the variable algorithm selection seem to be sound. Last but not least, one notices that the predictors of Table 1 are often chosen. This remark confirms both the robustness of the algorithm and the relevance of previous studies on precipitation concerning the choice of the predictors.

Frequency of predictors’ occurrence in variable selection algorithm. Variables representing central and extreme tendencies are preferred. Some covariables like CAPE, FX, or HU can be retained. It is interesting to see that only one-third of the predictors of the set is kept in more than 10% of the cases.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1

Frequency of predictors’ occurrence in variable selection algorithm. Variables representing central and extreme tendencies are preferred. Some covariables like CAPE, FX, or HU can be retained. It is interesting to see that only one-third of the predictors of the set is kept in more than 10% of the cases.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
Frequency of predictors’ occurrence in variable selection algorithm. Variables representing central and extreme tendencies are preferred. Some covariables like CAPE, FX, or HU can be retained. It is interesting to see that only one-third of the predictors of the set is kept in more than 10% of the cases.
Citation: Weather and Forecasting 34, 3; 10.1175/WAF-D-18-0149.1
REFERENCES
Akaike, H., 1998: Information theory and an extension of the maximum likelihood principle. Selected Papers of Hirotugu Akaike, E. Parzen, K. Tanabe, and G. Kitagawa, Eds., Springer, 199–213.
Alessandrini, S., L. Delle Monache, S. Sperati, and G. Cervone, 2015: An analog ensemble for short-term probabilistic solar power forecast. Appl. Energy, 157, 95–110, https://doi.org/10.1016/j.apenergy.2015.08.011.
Athey, S., J. Tibshirani, and S. Wager, 2016: Generalized random forests. arXiv preprint arXiv:1610.01271.
Baran, S., and S. Lerch, 2016: Mixture EMOS model for calibrating ensemble forecasts of wind speed. Environmetrics, 27, 116–130, https://doi.org/10.1002/env.2380.
Baran, S., and D. Nemoda, 2016: Censored and shifted gamma distribution based EMOS model for probabilistic quantitative precipitation forecasting. Environmetrics, 27, 280–292, https://doi.org/10.1002/env.2391.
Baran, S., and S. Lerch, 2018: Combining predictive distributions for the statistical post-processing of ensemble forecasts. Int. J. Forecasting, 34, 477–496, https://doi.org/10.1016/j.ijforecast.2018.01.005.
Ben Bouallègue, Z., 2013: Calibrated short-range ensemble precipitation forecasts using extended logistic regression with interaction terms. Wea. Forecasting, 28, 515–524, https://doi.org/10.1175/WAF-D-12-00062.1.
Breiman, L., 1996: Bagging predictors. Mach. Learn., 24 (2), 123–140.
Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.org/10.1023/A:1010933404324.
Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen, 1984: Classification and Regression Trees. CRC Press, 369 pp.
Bremnes, J., 2007: Improved calibration of precipitation forecasts using ensemble techniques. Part 2: Statistical calibration methods. Norwegian Meteorological Institute Tech. Rep. 04/2007, Norwegian Meteorological Institute, 38 pp.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 1–3, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
Bröcker, J., and L. A. Smith, 2007: Scoring probabilistic forecasts: The importance of being proper. Wea. Forecasting, 22, 382–388, https://doi.org/10.1175/WAF966.1.
Dabernig, M., G. J. Mayr, J. W. Messner, and A. Zeileis, 2017: Spatial ensemble post-processing with standardized anomalies. Quart. J. Roy. Meteor. Soc., 143, 909–916, https://doi.org/10.1002/qj.2975.
De Haan, L., and A. Ferreira, 2006: Extreme Value Theory: An Introduction. Springer, 418 pp.
Delle Monache, L., F. A. Eckel, D. L. Rife, B. Nagarajan, and K. Searight, 2013: Probabilistic weather prediction with an analog ensemble. Mon. Wea. Rev., 141, 3498–3516, https://doi.org/10.1175/MWR-D-12-00281.1.
Descamps, L., C. Labadie, A. Joly, E. Bazile, P. Arbogast, and P. Cébron, 2015: PEARP, the Météo-France short-range ensemble prediction system. Quart. J. Roy. Meteor. Soc., 141, 1671–1685, https://doi.org/10.1002/qj.2469.
Eckel, F. A., and L. Delle Monache, 2016: A hybrid NWP–analog ensemble. Mon. Wea. Rev., 144, 897–911, https://doi.org/10.1175/MWR-D-15-0096.1.
Friederichs, P., and T. L. Thorarinsdottir, 2012: Forecast verification for extreme value distributions with an application to probabilistic peak wind prediction. Environmetrics, 23, 579–594, https://doi.org/10.1002/env.2176.
Gagne, D. J., A. McGovern, and J. Brotzge, 2009: Classification of convective areas using decision trees. J. Atmos. Oceanic Technol., 26, 1341–1353, https://doi.org/10.1175/2008JTECHA1205.1.
Gagne, D. J., A. McGovern, S. E. Haupt, R. A. Sobash, J. K. Williams, and M. Xue, 2017: Storm-based probabilistic hail forecasting with machine learning applied to convection-allowing ensembles. Wea. Forecasting, 32, 1819–1840, https://doi.org/10.1175/WAF-D-17-0010.1.
Genuer, R., J.-M. Poggi, and C. Tuleau-Malot, 2010: Variable selection using random forests. Pattern Recognit. Lett., 31, 2225–2236, https://doi.org/10.1016/j.patrec.2010.03.014.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359–378, https://doi.org/10.1198/016214506000001437.
Gneiting, T., and R. Ranjan, 2011: Comparing density forecasts using threshold-and quantile-weighted scoring rules. J. Bus. Econ. Stat., 29, 411–422, https://doi.org/10.1198/jbes.2010.08110.
Gneiting, T., and M. Katzfuss, 2014: Probabilistic forecasting. Annu. Rev. Stat. Appl., 1, 125–151, https://doi.org/10.1146/annurev-statistics-062713-085831.
Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 1098–1118, https://doi.org/10.1175/MWR2904.1.
Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc. Series B Stat. Methodol., 69, 243–268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.
Hagedorn, R., R. Buizza, T. M. Hamill, M. Leutbecher, and T. Palmer, 2012: Comparing TIGGE multimodel forecasts with reforecast-calibrated ECMWF ensemble forecasts. Quart. J. Roy. Meteor. Soc., 138, 1814–1827, https://doi.org/10.1002/qj.1895.
Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550–560, https://doi.org/10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.
Hamill, T. M., 2019: Practical aspects of statistical postprocessing. Statistical Postprocessing of Ensemble Forecasts, S. Vannitsem, D. Wilks, and J. Messner, Eds., Elsevier, 187–217.
Hamill, T. M., and S. J. Colucci, 1997: Verification of ETA-RSM short-range ensemble forecasts. Mon. Wea. Rev., 125, 1312–1327, https://doi.org/10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2.
Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 3209–3229, https://doi.org/10.1175/MWR3237.1.
Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 2620–2632, https://doi.org/10.1175/2007MWR2411.1.
Hand, D. J., 2009: Measuring classifier performance: A coherent alternative to the area under the ROC curve. Mach. Learn., 77, 103–123, https://doi.org/10.1007/s10994-009-5119-5.
Hemri, S., M. Scheuerer, F. Pappenberger, K. Bogner, and T. Haiden, 2014: Trends in the predictive performance of raw ensemble weather forecasts. Geophys. Res. Lett., 41, 9197–9205, https://doi.org/10.1002/2014GL062472.
Herman, G. R., and R. S. Schumacher, 2018: Money doesn’t grow on trees, but forecasts do: Forecasting extreme precipitation with random forests. Mon. Wea. Rev., 146, 1571–1600, https://doi.org/10.1175/MWR-D-17-0250.1.
Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559–570, https://doi.org/10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.
Horton, P., M. Jaboyedoff, and C. Obled, 2017: Global optimization of an analog method by means of genetic algorithms. Mon. Wea. Rev., 145, 1275–1294, https://doi.org/10.1175/MWR-D-16-0093.1.
Hosking, J. R., and J. R. Wallis, 1987: Parameter and quantile estimation for the generalized Pareto distribution. Technometrics, 29, 339–349, https://doi.org/10.1080/00401706.1987.10488243.
Jolliffe, I. T., and C. Primo, 2008: Evaluating rank histograms using decompositions of the chi-square test statistic. Mon. Wea. Rev., 136, 2133–2139, https://doi.org/10.1175/2007MWR2219.1.
Jolliffe, I. T., and D. B. Stephenson, 2012: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley & Sons, 543 pp.
Junk, C., L. Delle Monache, and S. Alessandrini, 2015: Analog-based ensemble model output statistics. Mon. Wea. Rev., 143, 2909–2917, https://doi.org/10.1175/MWR-D-15-0095.1.
Katz, R. W., M. B. Parlange, and P. Naveau, 2002: Statistics of extremes in hydrology. Adv. Water Resour., 25, 1287–1304, https://doi.org/10.1016/S0309-1708(02)00056-8.
Keller, J. D., L. Delle Monache, and S. Alessandrini, 2017: Statistical downscaling of a high-resolution precipitation reanalysis using the analog ensemble method. J. Appl. Meteor. Climatol., 56, 2081–2095, https://doi.org/10.1175/JAMC-D-16-0380.1.
Lerch, S., T. L. Thorarinsdottir, F. Ravazzolo, and T. Gneiting, 2017: Forecaster’s dilemma: Extreme events and forecast evaluation. Stat. Sci., 32, 106–127, https://doi.org/10.1214/16-STS588.
Lobo, J. M., A. Jiménez-Valverde, and R. Real, 2008: AUC: A misleading measure of the performance of predictive distribution models. Global Ecol. Biogeogr., 17, 145–151, https://doi.org/10.1111/j.1466-8238.2007.00358.x.
Mallet, V., 2010: Ensemble forecast of analyses: Coupling data assimilation and sequential aggregation. J. Geophys. Res., 115, D24303, https://doi.org/10.1029/2010JD014259.
Manzato, A., 2007: A note on the maximum Peirce skill score. Wea. Forecasting, 22, 1148–1154, https://doi.org/10.1175/WAF1041.1.
Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci., 22, 1087–1096, https://doi.org/10.1287/mnsc.22.10.1087.
Meinshausen, N., 2006: Quantile regression forests. J. Mach. Learn. Res., 7, 983–999.
Messner, J. W., G. J. Mayr, and A. Zeileis, 2017: Nonhomogeneous boosting for predictor selection in ensemble postprocessing. Mon. Wea. Rev., 145, 137–147, https://doi.org/10.1175/MWR-D-16-0088.1.
Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281–293, https://doi.org/10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.
Naveau, P., R. Huser, P. Ribereau, and A. Hannart, 2016: Modeling jointly low, moderate, and heavy rainfall intensities without a threshold selection. Water Resour. Res., 52, 2753–2769, https://doi.org/10.1002/2015WR018552.
Papastathopoulos, I., and J. A. Tawn, 2013: Extended generalised Pareto models for tail estimation. J. Stat. Plan. Inference, 143, 131–143, https://doi.org/10.1016/j.jspi.2012.07.001.
Pinson, P., C. Chevallier, and G. N. Kariniotakis, 2007: Trading wind generation from short-term probabilistic forecasts of wind power. IEEE Trans. Power Syst., 22, 1148–1156, https://doi.org/10.1109/TPWRS.2007.901117.
Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 1155–1174, https://doi.org/10.1175/MWR2906.1.
Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 3885–3900, https://doi.org/10.1175/MWR-D-18-0187.1.
Roulin, E., and S. Vannitsem, 2012: Postprocessing of ensemble precipitation predictions with extended logistic regression based on hindcasts. Mon. Wea. Rev., 140, 874–888, https://doi.org/10.1175/MWR-D-11-00062.1.
Roulston, M. S., and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory. Mon. Wea. Rev., 130, 1653–1660, https://doi.org/10.1175/1520-0493(2002)130<1653:EPFUIT>2.0.CO;2.
Schefzik, R., 2016: Combining parametric low-dimensional ensemble postprocessing with reordering methods. Quart. J. Roy. Meteor. Soc., 142, 2463–2477, https://doi.org/10.1002/qj.2839.
Scheuerer, M., 2014: Probabilistic quantitative precipitation forecasting using ensemble model output statistics. Quart. J. Roy. Meteor. Soc., 140, 1086–1096, https://doi.org/10.1002/qj.2183.
Scheuerer, M., and G. König, 2014: Gridded, locally calibrated, probabilistic temperature forecasts based on ensemble model output statistics. Quart. J. Roy. Meteor. Soc., 140, 2582–2590, https://doi.org/10.1002/qj.2323.
Scheuerer, M., and T. M. Hamill, 2015: Statistical postprocessing of ensemble precipitation forecasts by fitting censored, shifted gamma distributions. Mon. Wea. Rev., 143, 4578–4596, https://doi.org/10.1175/MWR-D-15-0061.1.
Scheuerer, M., and D. Möller, 2015: Probabilistic wind speed forecasting on a grid based on ensemble model output statistics. Ann. Appl. Stat., 9, 1328–1349, https://doi.org/10.1214/15-AOAS843.
Schwarz, G., 1978: Estimating the dimension of a model. Ann. Stat., 6, 461–464, https://doi.org/10.1214/aos/1176344136.
Sloughter, J. M. L., A. E. Raftery, T. Gneiting, and C. Fraley, 2007: Probabilistic quantitative precipitation forecasting using Bayesian model averaging. Mon. Wea. Rev., 135, 3209–3220, https://doi.org/10.1175/MWR3441.1.
Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 2375–2393, https://doi.org/10.1175/MWR-D-15-0260.1.
Thorey, J., V. Mallet, and P. Baudin, 2017: Online learning with the continuous ranked probability score for ensemble forecasting. Quart. J. Roy. Meteor. Soc., 143, 521–529, https://doi.org/10.1002/qj.2940.
Tribus, M., 1969: Rational Descriptions, Decisions and Designs. Pergamon Press, 478 pp.
Van den Dool, H., 1994: Searching for analogues, how long must we wait? Tellus, 46A, 314–324, https://doi.org/10.3402/tellusa.v46i3.15481.
Van Schaeybroeck, B., and S. Vannitsem, 2015: Ensemble post-processing using member-by-member approaches: Theoretical aspects. Quart. J. Roy. Meteor. Soc., 141, 807–818, https://doi.org/10.1002/qj.2397.
van Straaten, C., K. Whan, and M. Schmeits, 2018: Statistical postprocessing and multivariate structuring of high-resolution ensemble precipitation forecasts. J. Hydrometeor., 19, 1815–1833, https://doi.org/10.1175/JHM-D-18-0105.1.
Weijs, S. V., R. Van Nooijen, and N. Van De Giesen, 2010: Kullback–Leibler divergence as a forecast skill score with classic reliability-resolution-uncertainty decomposition. Mon. Wea. Rev., 138, 3387–3399, https://doi.org/10.1175/2010MWR3229.1.
Wilks, D. S., 2015: Multivariate ensemble model output statistics using empirical copulas. Quart. J. Roy. Meteor. Soc., 141, 945–952, https://doi.org/10.1002/qj.2414.
Williams, R., C. Ferro, and F. Kwasniok, 2014: A comparison of ensemble post-processing methods for extreme events. Quart. J. Roy. Meteor. Soc., 140, 1112–1120, https://doi.org/10.1002/qj.2198.
Zamo, M., 2016: Statistical post-processing of deterministic and ensemble windspeed forecasts on a grid. Ph.D. thesis, Université Paris-Saclay.
Zamo, M., O. Mestre, P. Arbogast, and O. Pannekoucke, 2014: A benchmark of statistical regression methods for short-term forecasting of photovoltaic electricity production. Part II: Probabilistic forecast of daily production. Sol. Energy, 105, 804–816, https://doi.org/10.1016/j.solener.2014.03.026.
Zamo, M., L. Bel, O. Mestre, and J. Stein, 2016: Improved gridded wind speed forecasts by statistical postprocessing of numerical models with block regression. Wea. Forecasting, 31, 1929–1945, https://doi.org/10.1175/WAF-D-16-0052.1.
Zhou, B., and P. Zhai, 2016: A new forecast model based on the analog method for persistent extreme precipitation. Wea. Forecasting, 31, 1325–1341, https://doi.org/10.1175/WAF-D-15-0174.1.
Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The economic value of ensemble-based weather forecasts. Bull. Amer. Meteor. Soc., 83, 73–83, https://doi.org/10.1175/1520-0477(2002)083<0073:TEVOEB>2.3.CO;2.