## 1. Introduction

To benefit from the full potential of an ensemble prediction system, statistical postprocessing is a necessary step. Based on the evaluation of historical error characteristics, postprocessing addresses systematic model deficiencies. Statistical adjustments to the numerical forecasts aim at producing reliable probabilistic forecasts that are as sharp as possible (Gneiting et al. 2007).

To transform raw ensemble precipitation forecasts into well-calibrated probabilities, a variety of techniques exist. Logistic regression, a standard method with readily understood characteristics, has the advantage of addressing probabilities directly without any assumption about the underlying distribution functions. Logistic regression is demonstrated to be well suited to the calibration of medium-range ensemble precipitation forecasts (Hamill et al. 2008). Using the ensemble mean as a predictor, it performs as well as or better than more complex approaches (Wilks and Hamill 2007). Bentzien and Friederichs (2012) have shown that logistic regression also performed well within the context of short-range ensemble precipitation forecasts considering the first-guess probabilities as predictors.

The extension introduced by Wilks (2009) made the logistic regression even more attractive. Extended logistic regression allows one to derive full probability distributions. Explicitly including the predictive threshold as part of the prediction equation, one single equation is used to derive calibrated probabilities whatever the threshold of interest. The extended approach has not only the advantage of yielding mutually consistent probabilities but also considerably reduces the number of parameters that have to be estimated. Wilks (2009) has shown that, with one single “primary” predictor (the ensemble mean) and one “unification” predictor (the predictive threshold), extended logistic regression achieves a better level of performance than does standard logistic regression when the training period is limited. In an intercomparison study, Schmeits and Kok (2010) have also illustrated the competitiveness of extended logistic regression with respect to other calibration techniques.

The scheme proposed by Wilks (2009) unifies the postprocessing equations, introducing the predictive threshold as the only unification predictor. As a consequence, the regression coefficients are constant with the threshold. In terms of the underlying logistic curve, only the offset is modeled as a linear function of the threshold, while the steepness is threshold independent. Moreover, considering two predictors, the relative weight of one predictor with respect to the other could not be described as a function of the threshold. This rigidity of the scheme can also lead to some sensibility of the results to the threshold discretization.

To solve those problems, interaction terms are introduced into the extended logistic regression equation. New predictors enrich the original scheme in order to describe the influence of the unification predictor on the primary predictors. In fact, an interaction term attempts to describe how the effect of one predictor (e.g., the ensemble mean) depends on the value of a second predictor (the threshold). The additional predictors of the new extended logistic scheme are then simply defined as the product of the threshold with the primary predictors. The interaction effect in logistic regression is not a new concept and has already been investigated in social sciences (Jaccard 2001).

Extended logistic regression with interaction terms is here applied to precipitation forecasts derived from a limited-area high-resolution ensemble prediction system. COSMO-DE-EPS is a 20-member ensemble based on the convection-permitting German-focused Consortium for Small-Scale Modeling (COSMO-DE) model. The calibration of the precipitation forecasts is based on one single primary predictor, the ensemble mean. Verification is performed over a 3-month period covering the summer of 2011 and focusing on 6-h accumulated precipitation for forecast lead times of 6, 12, and 18 h, where the large amount of data allows one to investigate thresholds up to 20 mm. The aim of this study is, first, to show that extended logistic regression is also well suited within the framework of convection-permitting ensembles when applied to short-range precipitation forecasts. Second, different schemes of the extended logistic regression are compared to illustrate the impact of the interaction terms on the results. Section 2 describes the dataset and the verification strategy. Section 3 formalizes the introduction of interaction terms in the extended logistic regression equation. Section 4 presents an analysis of the regression parameters and verification results. We conclude this paper in section 5.

## 2. Dataset and verification

### a. Dataset

Developed at the German weather service [Deutscher Wetterdienst (DWD)], COSMO-DE-EPS is an ensemble prediction system based on the convection-permitting model COSMO-DE, a 2.8-km grid-spacing configuration of the COSMO model (Steppeler et al. 2003; Baldauf et al. 2011). The COSMO-DE domain is centered over Germany, with a rotated regular longitude–latitude grid, 421 × 461 grid points, and 50 model levels. COSMO-DE-EPS includes variations of lateral boundary conditions, model physics, and initial conditions [for details about the generation of ensemble members, please see Gebhardt et al. (2011) and Peralta et al. (2012)]. The preoperational version of COSMO-DE-EPS comprises 20 members and probabilities are generated applying a frequentist approach. The forecasts have a lead time up to 21 h and follow an update cycle of 3 h.

In this paper, accumulated precipitation for a time range of 6 h derived from the 0000 UTC run is investigated. Three forecast lead times (6, 12, and 18 h) and six thresholds [0.1, 1, 2, 5, 10, and 20 mm (6 h)^{−1}] are calibrated and verified using gauge-adjusted radar precipitation estimates [radar online adjustment (RADOLAN)] as the observational dataset. Those observations are a combination of hourly values point measured at the precipitation stations with the areal precipitation data of 16 weather radars (Weigl and Winterrath 2009). The verification domain covers Germany and the verification period is the summer (June–August) of 2011. Verification is performed at the model grid scale.

### b. Verification strategy

The reliability diagram compares the forecast probability and relative observed frequency of an event for a set of probability categories. The forecast probabilities are here divided into 21 categories defined as follows: 0%–2.5%, 2.5%–7.5%, … , 92.5%–97.5%, and 97.5%–100%. The performance of the calibration schemes can be judged qualitatively, perfect reliability being sketched by the diagonal line in the diagram. The 5% and 95% confidence intervals are also plotted, derived from a 1000-member block bootstrap sample considering each day as a separate block of fully independent data [similar to Hamill et al. (2008)]. The frequency of usage of each probability category, plotted as part of the reliability diagram, illustrates the sharpness of the probabilistic forecasts.

_{rawEPS}and BS

_{calEPS}are the Brier scores of the raw ensemble and calibrated forecasts, respectively. Choosing the raw ensemble as a reference forecast for the BSS calculation allows one to estimate the benefit of postprocessing per se. Similarly, reliability gain

*G*

_{Rel}and sharpness loss

*L*

_{Sha}are defined as

_{calEPS}= 0). “Sha” indicates the sharpness calculated as the mean weighted squared distance of the frequency of usage to the climatological frequency (Mason 2004). The sharpness loss has a value of 1 for climatological forecasts (Sha

_{calEPS}= 0). Since the climatological frequency is here calculated as the mean frequency over the verification period, Sha corresponds to the variance of the forecasts’ frequency-of-use distribution. Moreover, for a perfectly reliable forecast, Sha is equivalent to the resolution term of the Brier score decomposition. Indeed, if the reliability curve corresponds to the diagonal, the sharpness and resolution are both estimated as the mean weighted squared distance of the reliability curve to the climatological frequency.

The verification strategy hereafter adopted first consists of checking the quality of a postprocessing method in terms of its impact on the forecast reliability: the gain in reliability has to reach values near 1 in order to demonstrate the appropriateness of the approach. Next, successful postprocessing techniques have to be compared in terms of their impact on the forecast sharpness: the lower the loss in sharpness, the higher the usefulness of the calibrated forecasts.

## 3. Postprocessing

### a. Logistic regression schemes

*p*as the probability that the variable total precipitation exceeds a certain threshold

*T*. The logistic regression derives calibrated probabilities through the equation

*z*is a linear function of

*Np*predictors

*x*(hereafter called primary predictors),

*β*

_{0}is called the intercept and

*β*

_{1}, … ,

*β*are the regression coefficients. Those regression parameters are estimated using the maximum likelihood method (Wilks 2006). Equation (1) can be written as

_{Np}*β*are estimated for each threshold of interest:

*g*(

*T*) formalizes the role of the threshold as the predictor:

*β*′ are threshold independent. To preserve the consistency of the calibrated probabilities (i.e., that the probabilities are lower for increasing thresholds),

*g*(

*T*) must be decreasing with the threshold. In other words, it has to satisfy the equation

*g*(

*T*) is to consider only

*T*(or a transformation of

*T*) as a predictor:

*T*and

*q*the power transformation applied to

_{T}*T*. Equation (5) can be rewritten as

Comparing Eqs. (4) and (8), we note that this scheme of the extended logistic regression [introduced by Wilks (2006) and denoted hereafter as ExLR1] considers *β*_{0} evolving linearly with

*g*(

*T*) consists of including interaction terms. New predictors are then defined as the product of the threshold with the primary predictors:

*x*. Using the scheme with interaction terms, a point, line, or surface in the predictor space (considering one, two, or several predictors, respectively) can exist and define its limit of usage. In the particular case where

_{i}*Np*= 1 (one single primary predictor), Eq. (6) can be written as

*x*

_{1}= 0 in Eq. (13)] and considering nonnegative predictors, the scheme always stays valid if

### b. Application

*q*is defined empirically. After broad investigations, it has been noted that the transformation exponent must be adapted as a function of the precipitation accumulation time, and not as a function of the verification period (e.g., summer or winter season). For 6-h precipitation, we found that a value of 0.25 provides better results than other suggested values (0.5 or ⅓).

*w*

^{(j)}is then calculated for each training subsample

*j*as a function of the predictor mean value

The training period is a sliding window of the previous *Nt* cases of forecasts and verifying observations. The regression parameters are updated daily. The results based on those independent datasets are shown in section 4b. Investigation into the optimal length of the training period has shown that increasing the training period to more than 45 days does not bring any improvement to the final results [similar to Bentzien and Friederichs (2012)]. Then, *Nt* is set to 45. The block bootstrap method described in section 2b is also applied during the estimation of the regression parameters. Random combinations of the training days allow one to derive mean and standard deviation values of the regression parameters.

## 4. Results

### a. Regression parameters

To illustrate the dependency of the regressions parameters as a function of the threshold, standard logistic regression is applied. Intercepts and regression coefficients are then estimated independently for each threshold. The selected training period is the whole summer of 2011 in order to look at the trend over the period of interest. The regression parameters estimated here are not used for calibration or optimization purposes but just to illustrate the theoretical statements of the previous section. Figure 1 plots the value of the regression parameters on a square-root scale of the predictive threshold for the three investigated forecast lead times.

The intercepts decrease linearly with the square root of the threshold. The transformation of the threshold with *q _{T}* = 0.5 corresponds to the transformation found empirically by Wilks (2009). Thanks to this transformation, the influence of the threshold on the intercepts can be modeled through a linear function, as in Eq. (9).

The regression coefficients *β*_{1} increase with the transformed threshold, also showing a linear progression. The slope of the curve varies with the forecast lead time and *β*_{1} can be considered to be constant with the threshold for a lead time of 18 h. For this particular lead time, extended logistic regression without interaction terms is certainly sufficient since in that case *β*_{1} can be seen as being threshold independent. The introduction of the interaction term (the product of the transformed threshold and ensemble mean as the predictor) models the relationship between the threshold and regression coefficients in a more flexible way. Through a linear function as in Eq. (12), the influence of *T*^{0.5} on *β*_{1} can be described for any forecast lead time. Moreover, the limited variability of the coefficients within the training sample (illustrated by the vertical lines in Fig. 1b) indicates that adding one more predictor in the regression equation should not lead to overfitting but should help to describe an existing relationship.

Applying now the extended logistic regression for the same training period, we can derive the regression parameters for the two schemes ExLR1 and ExLR2. For a lead time of 6 h, the log odds [Eq. (3)] as a function of the ensemble mean are plotted in Fig. 2. For different thresholds, the lines are parallel for ExLR1 when no interaction terms are used (Fig. 2a), while different slopes characterize the use of the interaction term (Fig. 2b). In the latter case, since the regression lines are not constrained to be parallel in log-odds space, a limit of consistency can exist [expressed by Eq. (6)], which corresponds to the converging point of the regression functions. In this example, the lines in Fig. 2b cross for an amplitude of the ensemble mean around 1300 mm. This value is out of any realistic precipitation forecast, but this could be not always the case and must be checked for each implementation. We can also note that, in the case where the regression lines diverge in ℜ+, the scheme ExLR2 is valid for any predictor positively defined.

Logistic regressions on the log-odds scale for thresholds of 0.1, 1, 2, 5, and 10 mm (6 h)^{−1} (from bottom line to top line). The training period covers the summer of 2011 and the forecast lead time is 6 h. (a) The standard extended logistic regression scheme shows parallel lines while (b) the extended logistic regression scheme with an interaction effect shows different inclinations for each threshold.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Logistic regressions on the log-odds scale for thresholds of 0.1, 1, 2, 5, and 10 mm (6 h)^{−1} (from bottom line to top line). The training period covers the summer of 2011 and the forecast lead time is 6 h. (a) The standard extended logistic regression scheme shows parallel lines while (b) the extended logistic regression scheme with an interaction effect shows different inclinations for each threshold.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Logistic regressions on the log-odds scale for thresholds of 0.1, 1, 2, 5, and 10 mm (6 h)^{−1} (from bottom line to top line). The training period covers the summer of 2011 and the forecast lead time is 6 h. (a) The standard extended logistic regression scheme shows parallel lines while (b) the extended logistic regression scheme with an interaction effect shows different inclinations for each threshold.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Using the same parameters as for the previous picture, we can quantify the impact of the interaction term on the amplitude of the calibrated probabilities. Probability distribution functions are plotted in Fig. 3. The major differences between the distributions derived from the two schemes are visible for small thresholds combined with small ensemble mean values and high thresholds combined with high ensemble mean values. This last configuration is certainly the most interesting one. For high thresholds, higher probabilities (smaller 1 − *p* values in Fig. 3) are derived for high values of the ensemble mean from the extended logistic regression if the interaction term is used rather than neglected.

Probability distribution functions for selected ensemble mean values: 0.1, 1, 2, 5, and 10 mm (top curve to bottom curve). The dashed and solid lines are derived from the extended logistic regression scheme without an interaction term (scheme ExLR1) and with an interaction term (scheme ExLR2), respectively. The training period covers the summer of 2011 and the forecast lead time is 6 h.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Probability distribution functions for selected ensemble mean values: 0.1, 1, 2, 5, and 10 mm (top curve to bottom curve). The dashed and solid lines are derived from the extended logistic regression scheme without an interaction term (scheme ExLR1) and with an interaction term (scheme ExLR2), respectively. The training period covers the summer of 2011 and the forecast lead time is 6 h.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Probability distribution functions for selected ensemble mean values: 0.1, 1, 2, 5, and 10 mm (top curve to bottom curve). The dashed and solid lines are derived from the extended logistic regression scheme without an interaction term (scheme ExLR1) and with an interaction term (scheme ExLR2), respectively. The training period covers the summer of 2011 and the forecast lead time is 6 h.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

### b. Calibrated probabilities

Operational use of the postprocessing schemes is here mimicked using independent training samples for the estimation of the regression parameters, as described in section 3b. The performances of the calibration schemes are first assessed qualitatively through reliability diagrams for thresholds from 0.1 to 20 mm (6 h)^{−1}. Figure 4 shows the reliability and frequency of the usage curves, comparing the raw ensemble forecast and the calibrated probabilities derived from ExLR1 and ExLR2. The three forecast lead times of interest are here considered together. Independently of the extended logistic regression scheme used to derive the calibrated probabilities, the probabilistic forecasts are significantly improved by postprocessing. The performances are similar for the two schemes looking at small and intermediate thresholds up to 5 mm (6 h)^{−1}, showing underforecasting for the 0.1 mm (6 h)^{−1} threshold and reliability curves close to the diagonal otherwise. For higher thresholds, a tendency to underforecast is perceived for ExLR1 but not for ExLR2. The main difference between the two methods appears to be greater sharpness for ExLR2 and, in particular, the more frequent use of larger probabilities for the larger precipitation amounts. However, the small sample sizes [i.e., a small number of forecasts expresses a probability greater than 50% for a threshold of 20 mm (6 h)^{−1}] imply large sampling variations. Quantitative measures of skill and of postprocessing impact on forecast attributes will help to better compare the performances of the two schemes.

Reliability of raw and calibrated COSMO-DE-EPS forecasts for thresholds of (a) 0.1, (b) 1, (c) 2, (d) 5, (e) 10, and (f) 20 mm (6 h)^{−1}. The light gray curves represent the raw ensemble, the dark gray curves the calibrated ensemble using extended logistic regression without interaction terms (scheme ExLR1), and the black curves the calibrated ensemble using extended logistic regression with interaction terms (scheme ExLR2). The 5% and 95% confidence intervals using a block bootstrap resampling technique are also shown. The inset plots denote the frequency of use of each probability category, the climatological frequencies being represented by vertical lines.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Reliability of raw and calibrated COSMO-DE-EPS forecasts for thresholds of (a) 0.1, (b) 1, (c) 2, (d) 5, (e) 10, and (f) 20 mm (6 h)^{−1}. The light gray curves represent the raw ensemble, the dark gray curves the calibrated ensemble using extended logistic regression without interaction terms (scheme ExLR1), and the black curves the calibrated ensemble using extended logistic regression with interaction terms (scheme ExLR2). The 5% and 95% confidence intervals using a block bootstrap resampling technique are also shown. The inset plots denote the frequency of use of each probability category, the climatological frequencies being represented by vertical lines.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Reliability of raw and calibrated COSMO-DE-EPS forecasts for thresholds of (a) 0.1, (b) 1, (c) 2, (d) 5, (e) 10, and (f) 20 mm (6 h)^{−1}. The light gray curves represent the raw ensemble, the dark gray curves the calibrated ensemble using extended logistic regression without interaction terms (scheme ExLR1), and the black curves the calibrated ensemble using extended logistic regression with interaction terms (scheme ExLR2). The 5% and 95% confidence intervals using a block bootstrap resampling technique are also shown. The inset plots denote the frequency of use of each probability category, the climatological frequencies being represented by vertical lines.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Figure 5 shows the benefit of calibration in terms of BSS as a function of the forecast lead time for the same six thresholds. The mean BSS is always positive with values ranging from 5% to 20%. As a general feature, the calibration has a major positive impact for shorter forecast lead time when the lack of spread is more pronounced. For high thresholds, the tendency is the opposite (greater improvement for longer forecast lead time) but those results must be interpreted with regard to the loss in sharpness (Fig. 7). As expected from the analysis of the regression parameters, the introduction of the interaction terms has an impact only for forecast lead times of 6 and 12 h but not for 18 h, and for small and high thresholds but not for intermediate ones. The improvement brought by ExLR2 is mainly perceived for short forecast lead times and high thresholds. Looking at the confidence intervals (especially in Fig. 5f), ExLR2 appears to be more robust than ExLR1, which shows larger day-to-day performance variability.

BSS showing the benefit of calibration as a function of the lead time for thresholds of (a) 0.1, (b) 1, (c) 2, (d) 5, (e) 10, and (f) 20 mm (6 h)^{−1}. The gray and black lines refer to calibrated precipitation forecasts using extended logistic regression without an interaction term (scheme ExLR1) and with an interaction term (scheme ExLR2), respectively. The vertical bars indicate the 5% and 95% confidence intervals using a block bootstrap resampling technique.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

BSS showing the benefit of calibration as a function of the lead time for thresholds of (a) 0.1, (b) 1, (c) 2, (d) 5, (e) 10, and (f) 20 mm (6 h)^{−1}. The gray and black lines refer to calibrated precipitation forecasts using extended logistic regression without an interaction term (scheme ExLR1) and with an interaction term (scheme ExLR2), respectively. The vertical bars indicate the 5% and 95% confidence intervals using a block bootstrap resampling technique.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

BSS showing the benefit of calibration as a function of the lead time for thresholds of (a) 0.1, (b) 1, (c) 2, (d) 5, (e) 10, and (f) 20 mm (6 h)^{−1}. The gray and black lines refer to calibrated precipitation forecasts using extended logistic regression without an interaction term (scheme ExLR1) and with an interaction term (scheme ExLR2), respectively. The vertical bars indicate the 5% and 95% confidence intervals using a block bootstrap resampling technique.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Figure 6 shows the gain in reliability after calibration with respect to the raw ensemble forecast. The mean gains are above 90% for thresholds between 1 and 10 mm (6 h)^{−1}, which demonstrates the ability of extended logistic regression to considerably improve the forecast reliability. For the smallest threshold [0.1 mm (6 h)^{−1}], the performance is less satisfactory, as we have seen in the reliability diagram (Fig. 4a), but ExLR2 is able to perform better than ExLR1. Similarly, for the 20 mm (6 h)^{−1} threshold, ExLR2 performs better than ExLR1 for 6- and 12-h lead times, achieving reliability gains of greater than 85%. For high thresholds (Figs. 6e and 6f), the confidence intervals show that ExLR2 is well adapted for the calibration of the whole summer period, while ExLR1 may have poor performance and even deteriorate the forecast reliability in certain cases.

As in Fig. 5, but for gain in reliability.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

As in Fig. 5, but for gain in reliability.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

As in Fig. 5, but for gain in reliability.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

Figure 7 shows the sharpness loss due to calibration as a function of the forecast lead time for the same thresholds as Figs. 5 and 6. The loss of sharpness tends to decrease with the forecast lead time for small thresholds (Figs. 7a–c) and to increase with the forecast lead time for high thresholds (Figs. 7d–f). The loss of sharpness can be interpreted as the additional uncertainty (or spread) introduced by postprocessing, which compensates the underdispersiveness of the raw ensemble (more important at the beginning of the forecast lead time). The predictability of precipitation events decreases rapidly with forecast lead time and threshold, and calibrated probabilities tend to converge to climatological frequencies. Climatological forecasts are reliable but they are not useful to the forecasters since they are less sharp than any other forecast. Nevertheless, for thresholds of 10 and 20 mm (6 h)^{−1}, ExLR2 is able to significantly limit the loss of sharpness compared to ExLR1, performing at the same time equally well or better in terms of reliability (Figs. 6e and 6f).

As in Fig. 5, but for loss in sharpness.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

As in Fig. 5, but for loss in sharpness.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

As in Fig. 5, but for loss in sharpness.

Citation: Weather and Forecasting 28, 2; 10.1175/WAF-D-12-00062.1

## 5. Conclusions

In this paper, extended logistic regression with interaction terms is formulated, discussed, and applied to ensemble precipitation forecasts. Interaction terms are introduced into the extended scheme in order to describe the influence of the unification term (the predictive threshold) on the primary predictors. With the proposed extended scheme, the regression functions are not constrained to be parallel in the log-odds space. More flexibility is provided for the description of the relationship between threshold and predictors, but for each implementation, it has to be checked that the regression functions do not converge within physically relevant ranges.

Extended logistic regression is applied to short-range precipitation forecasts derived from COSMO-DE-EPS, a convection-permitting ensemble prediction system. The ensemble mean is considered to be the single primary predictor. Using interaction terms, a linear function describes the influence of the predictive threshold on the regression coefficient. The use of a second primary predictor has not been investigated here but, in its general formulation [Eqs. (11) and (12)], extended logistic regression with interaction terms enables us to also describe explicitly the relative weight of each primary predictor as a function of the threshold.

Considering independent training periods of 45 days, calibration is applied using two schemes of the extended logistic regression approach: one without an interaction term and one with an interaction term. Verification results performed over a 3-month summer period first show clear improvement of the probabilistic forecasts after calibration. Second, the use of the interaction term significantly improves the sharpness of the calibrated probabilities for high thresholds. Its ability to produce reliable probabilistic forecasts in a robust way and its relative good performance in terms of sharpness make extended logistic regression with interaction terms attractive for applications.

## Acknowledgments

Nina Schuhen, Christoph Gebhardt, and two anonymous reviewers are thanked for their valuable comments on draft versions of this manuscript.

## REFERENCES

Baldauf, M., Seifert A. , Förstner J. , Majewski D. , Raschendorfer M. , and Reinhardt T. , 2011: Operational convective-scale numerical weather prediction with the COSMO model.

,*Mon. Wea. Rev.***139**, 3887–3905.Bentzien, S., and Friederichs P. , 2012: Generating and calibrating probabilistic quantitative precipitation forecasts from the high-resolution NWP model COSMO-DE.

,*Wea. Forecasting***27,**988–1002.Gebhardt, C., Theis S. E. , Paulat M. , and Ben Bouallègue Z. , 2011: Uncertainties in COSMO-DE precipitation forecasts introduced by model perturbations and variation of lateral boundaries.

,*Atmos. Res.***100**, 168–177.Gneiting, T., Balabdaoui F. , and Raftery A. E. , 2007: Probabilistic forecasts, calibration, and sharpness.

,*J. Roy. Stat. Soc.***69B**, 243–268.Hamill, T. M., Hagedorn R. , and Whitaker J. S. , 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation.

,*Mon. Wea. Rev.***136**, 2620–2632.Jaccard, J., 2001:

*Interaction Effects in Logistic Regression.*Quantitative Applications in the Social Sciences, Sage University Paper 135, 70 pp.Mason, S. J., 2004: On using “climatology” as a reference strategy in the Brier and ranked probability skill scores.

,*Mon. Wea. Rev.***132**, 1891–1895.Peralta, C., Ben Bouallègue Z. , Theis S. E. , and Gebhardt C. , 2012: Accounting for initial condition uncertainties in COSMO-DE-EPS.

,*J. Geophys. Res.***117**, D07108, doi:10.1029/2011JD016581.Schmeits, M. J., and Kok K. J. , 2010: A comparison between raw ensemble output, (modified) Bayesian model averaging, and extended logistic regression using ECMWF ensemble precipitation reforecasts.

,*Mon. Wea. Rev.***138**, 4199–4211.Steppeler, J., Doms G. , Schättler U. , Bitzer H. W. , Gassmann A. , Damrath U. , and G. Gregoric, 2003: Meso-gamma scale forecasts using the nonhydrostatic model LM.

,*Meteor. Atmos. Phys.***82**, 75–96.Weigl, E., and Winterrath T. , 2009: Radargestützte Niederschlagsanalyse und -vorhersage (RADOLAN, RADVOR-OP).

*PROMET,***35,**78–86.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences.*2nd ed. Academic Press, 627 pp.Wilks, D. S., 2009: Extending logistic regression to provide full-probability-distribution MOS forecasts.

,*Meteor. Appl.***16**, 361–368.Wilks, D. S., and Hamill T. M. , 2007: Comparison of ensemble-MOS methods using GFS reforecasts.

,*Mon. Wea. Rev.***135**, 2379–2390.