## 1. Introduction

In many situations, it is possible to have access to several probabilistic forecasts of the same event (Clemen 1989; Graham 1996; Ariely et al. 2000; Winkler and Poses 1993). As these forecasts might be provided by independent models, nonnegligible differences can be observed. It is then necessary to find a combination of all forecasts for decision makers. Keeping the probabilistic forecast that performs best for some specific scores, thus dropping the others, is not an optimal choice. It is sometimes worth keeping the information of relatively poor probabilistic forecasts regarding these same specific scores, provided there is some degree of statistical independence between the forecasts.

Recently, the rise of artificial neural networks (ANN) for making predictions in various fields has also emphasized the power of forecast combination techniques. It can be observed for various Kaggle challenges (Pavlyshenko 2018) that the most performant ANN architectures (i.e., having the highest generalization capability) are actually aggregations of several individual ones (Chollet 2017). In the field of weather forecasting, the performance of aggregation methods has long been investigated and highlighted (Sanders 1963; Bosart 1975; Vislocky and Fritsch 1995; Baars and Mass 2005; Hamill et al. 2008; Ranjan and Gneiting 2010; Gneiting and Ranjan 2013). It is therefore legitimate to wonder whether there is an efficient strategy to aggregate probabilistic forecasts in order to capture most of the relevant features of the individual ones.

Several methods for combining probabilistic forecasts have been proposed in the literature. They either combine subjective forecasts made by meteorologists or objective ones from numerical weather prediction (NWP) models. Most of these techniques rely on a linearly weighted average of the probabilistic forecasts. For example, Sanders (1963) has suggested the use of the equally weighted average of 12 subjective probabilistic forecasts as a combination method. In this case study, it has been shown that this new aggregated probabilistic forecast had a positive Brier skill score relative to the climatological forecast, but, more surprisingly, relative to the best forecaster of the group as well. Vislocky and Fritsch (1995) investigated the average of two postprocessed [with a model output statistics (MOS) method] objective forecasts derived from two different high-resolution models. They concluded that the combination product had a higher skill than the two individual MOS forecasts, allowing one to provide reliable forecasts for higher lead times regarding temperature, wind speed, probability of cloud and precipitation amount. Other works related to a linearly weighted average aggregation of probabilistic forecasts include Winkler et al. (1977), Gyakum (1986), Baars and Mass (2005), and Hamill et al. (2008).

Ranjan and Gneiting (2010) have proved that a linearly weighted combination of distinct probabilistic forecasts is not the best combination strategy. In general it leads to uncalibrated forecasts, regardless of whether the underlying individual forecasts are calibrated or not. This important theoretical result does not state that such a combination would necessarily decrease the forecast skill of the combined forecasts below the forecast skill of the initial forecasts, but rather that it is suboptimal and can potentially be improved by using a nonlinear transformation instead. Thus, it does not contradict the other empirical results described in the previous paragraph. As a consequence, Ranjan and Gneiting (2010) suggested a beta-transformed linearly weighted combination of several forecasts. Their numerical results have highlighted some significant improvements in the reliability and sharpness of the forecasts compared to the classic linearly weighted average. The beta-transformed linearly weighted combination has later been adapted in Bassetti et al. (2018) for the combination of predictive probability distributions. For a comparison of methods for the combination of predictive distributions see Baran and Lerch (2018).

Following Ranjan and Gneiting’s work, the goal of the present paper is twofold: 1) to give another theoretical interpretation of calibrated and sharp combined probabilistic forecasts, and 2) to propose a nonlinear combination that enables one to significantly increase the forecast quality for a dichotomous event. The dichotomous event considered in this paper is that of precipitation above 0.1 mm h^{−1}. The suggested model is applied to two forecasts (called Ensemble-MOS and RadVOR) developed by Deutscher Wetterdienst (DWD), Germany’s National Meteorological Service. Ensemble-MOS is a short-term probabilistic forecast (up to 21 h), while RadVOR provides predictions for up to 2 h. Generally, RadVOR has better forecast scores for very short lead times, whereas for longer lead times Ensemble-MOS forecasts are preferably used. The proposed combination model is aimed at capturing most information of the two initial forecasts while achieving a seamless transition between both precipitation forecasts across several lead times, see Bowler et al. (2006), Golding (1998), and Kober et al. (2012).

The rest of the paper is organized as follows. In section 2, the Ensemble-MOS and RadVOR forecast data are described. A method is proposed for the transformation of the deterministic RadVOR forecasts into point probabilities, see Theis et al. (2005). Moreover, rain gauge adjusted radar precipitation measurements are presented as they are used for validation purposes. In section 3, the notions of calibration and sharpness are defined. Some theoretical considerations on calibrated and sharp probabilistic forecasts are also presented. In section 4, our model is described for the combination of two probabilistic forecasts. Then, in section 5, the proposed model is numerically validated. Finally, in section 6 it is shown that the developed method can also be applied to the combination of so-called area probabilities. The paper closes with a conclusion and an outlook to some future developments in section 7.

## 2. Data

### a. Ensemble-MOS

Ensemble-MOS of DWD is a model output statistics (MOS) system specialized for the optimization and calibration of probabilistic forecasts based on ensemble systems. In this paper it is applied to COSMO-DE-EPS, the ensemble system of the high-resolution convection-permitting model COSMO-DE of DWD. Ensemble products as mean and standard deviation for a set of model fields are used as predictors in multiple linear and logistic regressions against conventional synoptic observations including rain gauges, especially for precipitation forecasts. Ensemble-MOS forecasts based on 5 years of training data (2011–15) were used in order to provide precipitation forecasts from May to July 2016 with lead times from 1 to 21 h on a 20 km × 20 km grid.

### b. RadVOR

#### 1) Deterministic forecasts

DWD runs an operational quantitative precipitation estimation (QPE) system, called RADOLAN (Weigl and Winterrath 2010). The DWD radar network provides the basis for optimized national composites of current radar reflectivities to be generated on a 5-min update cycle. RADOLAN then combines empirical *Z*–*R* relationships with real-time rainfall gauge measurements from the synoptic station network to yield a calibrated best estimate of current rainfall rates.

For the purposes of providing forecasts and warnings of potential heavy rainfall on nowcasting time scales, DWD has developed a follow-on operational system, called RadVOR (Winterrath et al. 2012), which gives quantitative rainfall forecasts (QPF) for the next 2 h with an update cycle of 5 min. The rainfall estimates from RADOLAN are extrapolated forward in time with the aid of an optimized rainfall displacement vector field. This field is calculated via a mapping of precipitation patterns in successive image data, taking different spatial motion scales into account and using satellite motion vectors to add stability, for example in areas where no precipitation is present. RadVOR provides moving rainfall estimates in 5-min forecast steps on a 1 km × 1 km grid over the whole territory of Germany as well as summing up rainfall totals for the first and second forecast hours.

#### 2) Transformation of deterministic forecasts to probabilistic forecasts

A method is outlined to convert the deterministic RadVOR forecasts to hourly point probabilities on the same grid as the Ensemble-MOS forecasts in order to unify the format of both forecasts.

##### (i) Aggregation of RadVOR forecasts in time

While Ensemble-MOS provides predictions for time intervals of 60 min, RadVOR has a forecast interval of 5-min length. To unify the forecast lengths, all RadVOR forecasts within 1 h are aggregated by summation. The result is a deterministic prediction of precipitation amounts for one complete hour.

Recall that in this paper lead times up to +6 h are considered, although RadVOR only produces forecasts up to +2 h. Thus, when determining RadVOR forecasts for lead times above +2 h, the last available 5-min prediction is inserted repeatedly. This means that for periods with a lead time between +2 and +3 h, some of the 5-min predictions are identical. Aggregated predictions for periods with a lead time larger than +3 h are all identical and consist of the sum of 12 identical 5-min predictions. It is to be expected that this approach (compared to an aggregation of 12 different 5-min intervals) leads to concentrated peaks of precipitation and therefore leads to a biased forecast.

It has been tested how well the hourly forecasts would perform if the last 12 available 5-min forecasts would be used repeatedly instead for higher lead times. This alternative approach leads to a smaller bias of −0.005 for lead times from +2 to +6 h, but the Brier skill score and the reliability are significantly worse.

It should be noted that the development of a more sophisticated transformation from deterministic to probabilistic forecasts is outside the scope of this paper. The transformed RadVOR forecast merely serves as uncalibrated initial forecast for the proposed combination method. Furthermore, the decision to consider lead times longer than +2 h was made once it turned out that the combination of both forecasts is feasible for up to +6 h. The RadVOR forecast still holds some valuable information for higher lead times, even if a persistence-based extrapolation for up to +6 h seems not completely satisfactory from a meteorological perspective.

##### (ii) Local averaging

*V*(

*r*′) denote this binarized value for a grid point

*r*′ ∈

*R*′ on the 1 km × 1 km grid

*R*′ and let

*R*denote the 20 km × 20 km grid. Finally, a weighted average

*r*∈

*R*using the following formula:

*w*(

*r*,

*r*′) = ||

*r*−

*r*′||

^{−1.75}, where ||⋅|| is the Euclidean distance. The exponent −1.75 has been chosen empirically from the set {−1, −1.25, …, −2.75, −3}, because it achieved the best reliability for the lead time +1 h. The resulting average is considered as the probability for the exceedance of 0.1 mm of precipitation. Since the influence of

*V*(

*r*′) on

*r*and

*r*′, only grid points with ||

*r*−

*r*′|| ≤ 50 km are considered.

### c. Calibrated hourly radar measurements

To validate the results obtained in in this paper, rain gauge adjusted radar precipitation measurements are used. The measurements were made by the German operational radar network of DWD (Winterrath et al. 2012), which covers Germany with 16 radar sites that provide scans in intervals of 5 min.

The rate of precipitation is derived by transforming the measured radar reflectivities based on empirical reflectivity–precipitation rate (*Z*–*R*) relationships, whereas 0.1 mm h^{−1} of precipitation is the minimum amount that can be detected. To improve accuracy, the precipitation amounts are adjusted according to the measurements of about 1300 rain gauges that are located at meteorological measurement sites. Finally, pixel artifacts, which may occur in radar scans, are removed by a clutter filter as proposed by Winterrath and Rosenow (2007).

## 3. Mathematical background

Let *σ*-algebra of subsets of Ω, and

### a. Self-calibration as an optimal combination approach

Let *P* be a continuous random variable taking values in the unit interval [0, 1], and *Y* be a dichotomous random variable taking as values 1 with probability *q* and 0 with probability 1 − *q*, where 0 ≤ *q* ≤ 1. The random variable *P* represents a probabilistic forecast for the event *Y* = 1 (i.e., that the amount of precipitation exceeds the threshold *T* = 0.1 mm).

*P*is said to be calibrated if

*Y*= 1 occurs, given the probabilistic forecast

*P*. Analogously,

*Y*given

*P*. This notion of calibration means that the information delivered by the probabilistic forecast

*P*is

*reliable*, see also Murphy and Winkler (1977, 1987). A direct consequence of Eq. (2) is that on average the forecast provides the probability of appearance of the event

*Y*= 1 [i.e.,

*P*is uncalibrated, then

*f*is an unknown deterministic function. Besides, from basic properties of conditional expectation, the random variable

*f*(

*P*) is itself calibrated (see appendix A for some mathematical background). Naturally,

*f*(

*P*) is called the

*self-calibrated version*of

*P*. More generally, if

*P*

_{1}, …,

*P*

_{n}are

*n*probabilistic forecasts, then

*n*probabilistic forecasts.

The notion of calibration is an important property that a probabilistic forecast should exhibit. However, the notion of calibration is not sufficient for characterizing the skill of a forecast. For example, the climatological forecast *P*, which predicts the average probability of precipitation only, is perfectly calibrated but not a useful prediction. Therefore, assuming calibration, the notion of *sharpness* makes it possible to discriminate the useful informative forecasts (Gneiting et al. 2007).

The sharpness is defined as the variance Var(*P*) of the forecast *P* and corresponds to the dispersion of the forecast from the forecast average. The sharper a forecast, the more *P* takes values close to 0 and 1; hence, the higher the variance. Note that sharpness alone is not a measure for forecast quality, since sharpness is only a property of the distribution of the predicted probabilities but is not affected by how accurate these probabilities are.

*f*(

*P*) of

*P*is the most sharp probabilistic forecast among all calibrated ones that depend on

*P*in the sense that it is the solution of

*G*is the set of deterministic functions

*g*: [0, 1] → [0, 1] such that

*g*(

*P*) is a well-defined random variable. The proof of Eq. (5) is given in appendix B. This result generalizes naturally for the self-calibrated version

*f*(

*P*

_{1}, …,

*P*

_{n}) of several probabilistic forecasts

*P*

_{1}, …,

*P*

_{n}. Note that in Ranjan and Gneiting (2010) it has been proven that a linear combination of

*n*forecasts given by

*w*

_{1}, …,

*w*

_{n}are some weights, lacks calibration and sharpness compared to the self-calibrated version of the forecasts. Our approach is more general in that it combines the initial forecasts in a nonlinear way and considers interactions between them.

*Y*with respect to the

*L*

_{2}-norm:

*Y*on the space of

*σ*(

*P*

_{1}, …,

*P*

_{n})-measurable random variables, where

*σ*(

*P*

_{1}, …,

*P*

_{n}) is the sub

*σ*-algebra of

*P*

_{1}, …,

*P*

_{n}. Equation (6) means that

*f*minimizes the expected Brier score (see section 5) and also any strictly proper scoring rule as proven by Ranjan and Gneiting (2010).

For all of these reasons, the self-calibrated version of any set of probabilistic forecasts is the best combination method to employ. However, in general the self-calibrated version *f* of forecasts is unknown and therefore intractable: in practice it is not possible to have a closed-form formula for the function *f* (only the existence is ensured). Therefore, some parametric assumptions are usually made on *f*.

### b. Parametric types of combination

*f*is the

*linear pool f*

_{LP}defined by

*w*

_{1}, …,

*w*

_{n}are such that 0 ≤

*w*

_{i}≤ 1 and

*f*

_{BLP}, where

*H*

_{α,β}in Eq. (8) is the cumulative distribution function of the beta distribution with shape parameters

*α*> 0 and

*β*> 0 defined by

*P*

_{1}, …,

*P*

_{n}.

In the present study, a new type of approximation is proposed for the self-calibrated version of two probabilistic forecasts that leads to a reliable and sharp forecast as highlighted in section 5. The approximation is based on the logistic transformation of a nonlinear combination of the underlying initial probabilistic forecasts with some interaction terms. This approximation of *f* is described in detail in the next section.

## 4. Generalized logit combination

The approximation of a conditional expectation of a dichotomous random variable *Y* given a set of predictors *P*_{1}, …, *P*_{n} is often achieved with a so-called logit model (or logistic regression). In the literature, this model has been used for MOS methods in order to postprocess ensemble members returned by a probabilistic forecast (Hamill et al. 2008; Wilks 2009; Ben Bouallègue 2013). In the present paper, a more general version of the logit model is proposed to approximate the self-calibrated version of a set of probabilistic forecasts. More specifically, the approximation is explicitly detailed for the combination of two probabilistic forecasts that generally give different predictions.

### a. Logit combination with triangular functions

*P*

_{1}, …,

*P*

_{n}, the

*standard logit model*is given as follows:

*σ*(

*x*) = 1/[1 + exp(−

*x*)] is the sigmoid function and the coefficients

*a*and

*b*

_{1}, …,

*b*

_{n}are some model parameters. Note that

*a*is usually called the intercept of the model.

*P*

_{i}are not necessarily well calibrated. In such a situation, the standard combination model given by Eq. (10) may lead to an uncalibrated forecast as the sigmoid function of the simple linear pool is not flexible enough to compensate for the possible underestimation and overestimation of the

*P*

_{i}’s (see Fig. 1 for an example of deviations). To mitigate these effects, each probabilistic forecast

*P*

_{i}is split into several predictors

*ϕ*

_{0}(

*P*

_{i}), …,

*ϕ*

_{m}(

*P*

_{i}), where the functions

*ϕ*

_{0},

*ϕ*

_{1}, …,

*ϕ*

_{m}are given by

*j*∈ {0, 1, …,

*m*}. These functions are called

*triangular functions*. In Fig. 2 a set of triangular functions is shown for

*m*= 5. Noticing that

*x*∈ [0, 1], the intercept coefficient becomes unnecessary and the logit model of Eq. (10) transforms into a more flexible model

*f*

_{LT}(

*P*

_{1}, …,

*P*

_{n}) based on the triangular functions

*ϕ*

_{0}, …,

*ϕ*

_{m}:

*n*= 1, the logit combination model stated in Eq. (12) takes the following form:

*w*

_{1}, …,

*w*

_{m}are some parameters and the family of triangular functions

*ϕ*

_{0},

*ϕ*

_{1}, …,

*ϕ*

_{m}is constructed such a way that the expression

*m*,

*w*

_{0}), (1/

*m*,

*w*

_{1}), …, (

*m*/

*m*,

*w*

_{m}), which transforms the values of

*P*

_{1}accordingly. In this way, the model given in Eq. (13) is able to compensate over and underestimations for different values of

*P*

_{1}at the same time.

### b. Interaction terms

*P*

_{1}and

*P*

_{2}. Let

*m*be the chosen number of triangular functions. Figure 3 shows the effects of single triangular functions on the output of the combination model. The output of the combination model

*f*

_{LT}for the crossing points (0.1, 0.1), (0.1, 0.8), (0.5, 0.1) and (0.5, 0.8) in the bottom-left subplot is fully determined by the coefficients of the four triangular functions. While there are four points and four coefficients, it is generally impossible to find a set of coefficients such that the model output for these four points matches with an arbitrary set of four probabilities (i.e., the model can choose the 4 coefficients so that the probabilities of only 3 of the 4 points are correctly predicted). See appendix C for a proof. To be able to make correct predictions for all four points, the model needs more degrees of freedom. For this, some

*interactions terms*of the forecasts

*P*

_{1}and

*P*

_{2}are considered, which consist of the four functions

*γ*

_{1},

*γ*

_{2},

*γ*

_{3},

*γ*

_{4}defined on [0, 1]

^{2}by

*p*

_{1},

*p*

_{2}∈ [0, 1].

*generalized logit combination model*:

*a*

_{ij}and

*b*

_{ij}are some model parameters. Thus, there are 6(

*m*+1) parameters to be fitted.

In the upper-right subplot of Fig. 3 three triangular functions for *γ*_{1} are depicted. The triangular functions of the interaction terms allow the model to choose coefficients for the case when the two forecasts *P*_{1} and *P*_{2} predict both high probabilities (for *γ*_{1}), low probabilities (for *γ*_{4}), or make diverging predictions (for *γ*_{2} and *γ*_{3}), namely the four corners of [0, 1]^{2}.

It has to be emphasized that the model given in Eq. (14) creates a fine-tuned combination between *P*_{1} and *P*_{2} with interaction terms, but also enables to be corrected systematic unreliable forecasts as a MOS method would do. A numerical validation of the combination model proposed in Eq. (14) is performed in the next section.

## 5. Numerical validation

In this section, the performance of the combination model proposed in Eq. (14) is analyzed using several validation scores. In particular, the model given in Eq. (14) is compared to the initial probabilistic forecasts (RadVOR and Ensemble-MOS) and also to the standard logit combination model *f*_{L} given in Eq. (10).

### a. Validation scores

Various forecast scores can be used in order to assess the accuracy and the skill of a forecast (Wilks 2006). The following validation scores are considered in this paper: bias, Brier score, Brier skill score, reliability, and reliability diagram.

#### 1) Bias

*bias*of a probabilistic forecast

*P*is defined as the expected difference between the forecast

*P*and the dichotomous random variable

*Y*with

*P*makes predictions with a bias close to 0, which indicates that the occurrence of rain is neither overestimated nor underestimated on average. As already mentioned in section 3, a calibrated forecast

*P*is necessarily unbiased.

#### 2) Brier score and Brier skill score

*Brier score*(BS) is given by the expected squared error between the forecast

*P*and the dichotomous random variable

*Y*:

*Brier skill score*(BSS) is often used. It is based on a comparison of the Brier score of the forecast and the one of a reference forecast

*P*

_{ref}used as a

*benchmark*:

*P*

_{ref}=

*q*for the selected period of May–July 2016 of the occurrence of precipitation exceeding the threshold 0.1 mm is considered as a reference forecast. Note that if the Brier score of the forecast is lower than that of the reference forecast, then the Brier skill score is positive. In this case, the proposed forecast is considered to be skillful.

#### 3) Reliability and reliability diagram

*reliability*score is considered as a measure of conditional bias. Assume that for the probabilistic forecast

*n*predictions

*p*

_{1}, …,

*p*

_{n}are available, which correspond to

*n*observations

*y*

_{1}, …,

*y*

_{n}of the considered event. Denote by

*B*

_{1}, …,

*B*

_{I}a partition of the unit interval [0, 1] into

*I*subintervals. Each partition component

*B*

_{i}contains

*N*

_{i}values of forecasts

*p*

_{k}. These forecast values correspond to the observations of the event

*y*

_{k}. By

*B*

_{i}is denoted and by

*B*

_{i}:

The *reliability diagram* is the graphical representation of the (*p*_{k}, *y*_{k}) pairs. The deviation of the reliability diagram from the first bisector of the axes is a qualitative visualization of the reliability. For a quantitative assessment, each reliability diagram is enclosed in a band. The upper and lower end of the band are the 95% and 5% quantiles of the reliability diagrams for single locations.

### b. Training and testing procedure

For the validation results presented in this section, each forecast has been trained and tested using a *rolling-origin with reoptimization scheme* initially proposed by Armstrong and Grohman (1972). During this procedure, the model is updated with new training data for each hourly step of the time series in chronological order. The point in time *T*, until which the model has been trained, is called the *forecasting origin* and represents the current time in an operational scenario. The forecasting origin splits the data into available data from the past (training set) and unavailable data from the future (the test set). For each training step, the forecasting origin is moved 1 h forward in time and the model is updated with the new data that became available for training. The update means that the optimization procedure is run with the new available data. At the forecasting origin *T*, the model makes predictions for the future time interval [*T* + *L* − 1, *T* + *L*], where *L* is the chosen lead time in hours. The forecasting origin *T* is rolled over until *T* + *L* ≤ *M*, where *M* is the final time of the dataset. As the forecast quality of the initial forecasts (here RadVOR and Ensemble-MOS) are likely to depend on the lead times, each model has been trained independently for the considered lead times. Therefore, it is possible to assess the accuracy and the skill of the combination model with respect to the lead times.

The rolling-origin with reoptimization approach enables us to have more testing data when the dataset is not too large and quantify the amount of data required for the training (Tashman 2000). The next section provides the results of an experimental study of the training procedure for the proposed combination model *f*_{LTI} in Eq. (14).

### c. Evaluation of the fitted model

Before fitting the model to a given dataset, two important parameters, called *hyperparameters*, need to be fixed:

the learning rate

*η*used in the optimization algorithm for updating the model parameters, where the so-called*stochastic gradient descent algorithm*is considered in the present paper, see also Bottou (2010). The learning rate determines the magnitude of change of the parameters in each training step: a too high learning rate value may cause the algorithm to miss the global minimum (or a desirable local minimum), but a too small value may result in the algorithm taking long to converge or even getting stuck in an undesirable local minimum (see also Goodfellow et al. (2016) for further details),the number

*m*of triangular functions*ϕ*_{1}, …,*ϕ*_{m}for the proposed combination model.

In Fig. 4 the effect of *η* and *m* on the validation scores is shown. It seems that models with a higher number of triangular functions also require a higher learning rate. However, there does not seem to be a combination of hyperparameters that is superior to all others, especially if the same set of hyperparameters is chosen for all lead times. For the results presented in this paper, the hyperparameters of the model *f*_{LTI} have been set to *η* = 0.0005 and *m* = 10, which perform well for all considered forecast scores and all considered lead times. While there are other hyperparameter configurations with a similar performance, it has to be taken into account that the number of model weights increases with an increase of *m* and therefore should be chosen as low as possible.

For the standard logit combination model *f*_{L} the appropriate learning rate *η* has been determined in a similar way, by comparing the Brier skill scores for different learning rates, where *η* = 0.0025 performed best for short lead times, *η* = 0.001 for the midrange lead times and *η* = 0.0005 for long lead times. Since the differences were not significant (below 0.001), *η* = 0.001 was chosen for all lead times.

Once the hyperparameters were fixed, the models were fitted to the data using the rolling-origin with reoptimization procedure (see section 5b). Figure 5 visualizes the output of the fitted model *f*_{LTI} and the corresponding observed probabilities. Notice that the proposed combination model gives more significance to forecasts provided by RadVOR for short lead times, while Ensemble-MOS is given more emphasis for longer lead times. This is in accordance with the validation scores since the RadVOR forecasts perform better than Ensemble-MOS forecasts at shorter lead times and worse for the longer lead times (see Figs. 6 and 1).

Figure 7 depicts the distribution of the parameters *a*_{ij} and *b*_{ij} of the fitted combination model *f*_{LTI} introduced in Eq. (14) for the months of June (in red) and July (in blue) with violin plots. In this model, the initial probabilistic forecasts *P*_{1} and *P*_{2} (based on Ensemble-MOS and RadVOR) are split into 11 triangular functions *ϕ*_{0}, …, *ϕ*_{10}, resulting in 11 parameters for each probabilistic forecast. Also, each interaction term *γ*_{1}, *γ*_{2}, *γ*_{3} and *γ*_{4} is decomposed into 11 triangular functions. For each value *x* ∈ {0, 0.1, …, 0.9, 1} on the *x* axis, there is a triangular function *ϕ*, with *ϕ*(*x*) = 1, the corresponding parameter of which is depicted at *x* in Fig. 7. For example for the value *x* = 0 regarding the RadVOR column, the violin plots in blue and red, respectively (red), can be seen as the influence of RadVOR predictions close to the value *x* = 0 on the combination model for the month of June and July. For the lead time +1 h the RadVOR parameters range from −2 to +1.5, while the Ensemble-MOS parameters are between −0.5 and 0.5. Therefore, the predictions based on RadVOR have a larger influence on the combined forecast. With increasing lead times, Ensemble-MOS parameters spread out further and RadVOR parameters move closer to 0. These observations are consistent with those made regarding Fig. 5. Moreover, the parameters for Ensemble-MOS and *γ*_{1} at *x* = 1 are close to zero because Ensemble-MOS made almost no predictions close to 1 (see the bar plots in Fig. 1 and data plots in Fig. 5). Therefore, these parameters get seldomly updated and stay close to 0. It is notable that most parameters show a similar distribution for both months of June and July. Data for the month of May has been omitted due to the warm-up period at the beginning of the training, which leads to different parameter distributions for May in comparison to June and July. Also, it can be seen that the variance of the parameter distribution increases for longer lead times. This is probably due to increased forecast errors in the initial forecasts. Note that if all 11 weights of a predictor are arranged on a line, then the triangular functions mimic the behavior of a standard logit combination model with one parameter for each initial predictor. However, the ability to choose parameters in a nonlinear way leads to a more general and flexible combination model.

The interaction terms *γ*_{1} and *γ*_{4} take values close to 1 if both initial forecasts agree. In Fig. 7 it can be seen that if both initial forecasts predict precipitation, *γ*_{1} further increases the predicted probability of the model, while if both initial forecasts predict no precipitation, *γ*_{4} decreases the predicted probability further. *γ*_{2} takes values close to 1 if Ensemble-MOS predicts no precipitation, but RadVOR does. For lower lead times, when RadVOR has a high forecast skill, *γ*_{2} further increases the predicted probability of the model. For higher lead times and a lower forecast skill of RadVOR, the weights of *γ*_{2} move closer to zero. Similarly the slope of *γ*_{3} changes with increasing lead time according to which of the initial forecasts has a higher forecast skill.

The bias, Brier skill score, reliability, and sharpness of the initial forecast, of the standard logit combination model *f*_{L} and of the proposed combination model *f*_{LTI} are shown in Fig. 6. The boxplot diagrams represent the variability of the daily scores depending on lead time. They measure the consistency of the probabilistic forecasts from day-to-day predictions: the wider a boxplot diagram is, the less consistent is the model. The continuous lines represent the validation scores over all locations and points in time of the dataset. Note that the Brier skill score of 3 months is not equal to the average daily Brier skill score, which is more sensitive to days with a low Brier skill score. The overall scores for the combination model *f*_{LTI} are significantly better than those for the initial probabilistic forecasts with respect to the Brier skill score and the reliability. Ensemble-MOS shows little increasing bias, RadVOR a negative bias of −2% and the combination models are almost perfect for the 3 month average. Moreover, the daily predictions of the proposed model are more consistent than the initial forecasts. Besides, the proposed combination model preserves the sharpness for short lead times, but decreases it for longer lead times. Notice that all the scores of *f*_{LTI} are also improved compared to the standard logit combination model. To see the effect of interaction terms on the validation scores, the forecasts have been combined with a model of type *f*_{LT}, which extends the logistic regression model *f*_{L} with triangular functions only. The results show that *f*_{LTI} compared to *f*_{LT} (not shown here) has improved bias, reliability and sharpness.

Reliability diagrams are shown for these probabilistic forecasts in Fig. 1. The histograms represent the empirical distributions of the probabilistic forecasts. It seems that the combination model *f*_{LTI} is significantly more reliable for all lead times compared to the initial probabilistic forecasts and to the standard logit combination model. Figures 6 and 1 highlight that the *f*_{LTI} combination model has a higher accuracy and skill than the initial probabilistic forecasts without impacting too much of the sharpness.

For the results presented in this paper, the combination model *f*_{LTI} has been trained on all point probabilities regardless of their corresponding location. Therefore the combination model cannot correct local errors, which affect only a subset of locations. To assess how well the combination model performs for single stations, the considered forecast scores for each location are shown in Figs. 8–10. Especially for the bias and the Brier skill score local differences can be observed for the combination model. However these differences seem to occur already in the initial forecasts and are not introduced by the combination model. In Fig. 10 the local reliability of the combination model is much more homogeneous than for both initial forecasts.

In Fig. 11 the initial and combined point probabilities are illustrated for one hour to showcase the seamless transition between both initial forecasts.

### d. Runtime of the fitted model

In addition to validation scores, the runtime of a model is critical for operational use, especially if the initial forecasts have a fast update cycle of a few minutes like RadVOR. To benchmark the runtime of the proposed combination model *f*_{LTI}, the model was run on an Intel Core i7–860 (2.8 GHz).

To combine 2210 hourly forecasts for approximately 1370 locations and 8 lead times, it took 41 min and 11 s to combine both considered forecasts, which corresponds to 1.118 s per hourly forecast. This includes reading the initial forecasts from a file, making a prediction for each location, saving the new predictions to a file and updating the model parameters with the new observations. The transformation of the RadVOR forecasts has not been considered in this evaluation, since the transformation is independent of the combination itself and does not affect the runtime in the general use case of the proposed model *f*_{LTI}.

Note that the model only requires the most recent information of the last hour to make the next prediction and to update the model parameters, which results in the short runtime and also in a low memory use.

## 6. Application to area probabilities for warning events

In this section the wide applicability of the approach proposed in this paper for the calibrated combination of probabilistic precipitation forecasts is demonstrated. More precisely, we show that our approach can also be used for the calibrated combination of so-called area probabilities. Note that most NWP models generate predictions for single points on a certain grid. This is also the case for RadVOR and Ensemble-MOS. In Kriesche et al. (2015), a stochastic geometry model has been introduced, which calculates area probabilities based on point probabilities. This model was developed for the generation of weather warnings. For instance, in order to predict the likelihood of flooding, the probability of precipitation within the catchment area of a river is of interest, without knowing the exact location of the precipitation event. Similarly, emergency forces might have an interest in the area probability for critical weather events in their area of responsibility.

*A*. From this definition, it follows that area probabilities of a given weather event are at least as large as the probabilities for single points or arbitrary subsets within

*A*. Formally, the area probability

*p*(

*A*) for the occurrence of precipitation anywhere inside

*A*has the following representation (see e.g., Hess et al. 2018):

*S*is the set of points for which point probabilities are given,

*V*(

*s*) is the Voronoi cell corresponding to location

*s*,

*a*(

*s*) is a model parameter representing the number of precipitation cells per unit area in

*V*(

*s*). Furthermore,

*ν*

_{2}[

*G*⊕

*b*(

*o*,

*r*)] is the area of the dilated set

*A*⊕

*b*(

*o*,

*r*) where

*A*⊕

*b*(

*o*,

*r*) denotes the Minkowski sum of

*A*and the disk

*b*(

*o*,

*r*), which is centered at the origin and has some radius

*r*> 0 (Chiu et al. 2013). Note that the model parameters

*r*and

*a*(

*s*) for all

*s*∈

*S*are estimated on the basis of corresponding point probabilities. For further details, we refer to Kriesche et al. (2015, 2017).

In principle, combined area probabilities can be computed in two different ways. Namely, they can be computed

based on already combined point probabilities (method 1);

for point probabilities of each initial forecast and then combined by the proposed combination model

*f*_{LTI}(method 2).

In Fig. 12 the validation scores for area probabilities based on RadVOR, Ensemble-MOS and their combination are compared, where the area probabilities for Ensemble-MOS and RadVOR show similar behavior as the corresponding point probabilities in Fig. 6. Based on these forecast scores, Fig. 12 shows that method 2 leads to a much smaller bias and better reliability than method 1, whereas the BSS does not show any significant difference. Thus, when computing calibrated area probabilities, method 2 described above should be used.

## 7. Conclusions

The combination model presented in this paper for combining probabilistic forecasts demonstrates significant improvements in forecast accuracy, skill and consistency with respect to all considered forecast scores. The forecast scores show even a large improvement for lead times where currently no RadVOR forecasts are available. Both the conversion of deterministic RadVOR predictions to probabilistic forecasts and the fitting of the proposed combination model are computationally rather cheap and, therefore, they allow for a seamless update of Ensemble-MOS forecasts.

Furthermore, the method has been applied to the combination of area probabilities, which can be used for warning events. The computation of area probabilities is based on a stochastic geometry model using point probabilities. The proposed method has been used to highlight that area probabilities should be computed from the point probabilities first and then combined with the combination model.

The combination model has not been applied to thresholds other than 0.1 mm yet. It is likely that a model trained for some threshold would not yield satisfactory results if it were applied to forecasts of another threshold. Therefore it would be required to train a separate model for each threshold and thus also increase the amount of parameters used in total.

Note that combination models of the type considered in this paper could also be constructed using artificial neural networks (ANN). For such models, there is no need to specify the explicit parametric form between the underlying initial probabilistic forecasts and the event that is being predicted. Thus, ANN models may allow for more flexibility. Besides, it may also be possible to train a general ANN for the combination of forecasts, which can predict exceedance probabilities not only for one threshold, but for several thresholds simultaneously. In this case, the consistency of the calibrated probabilities has to be ensured [i.e., the probabilities have to be smaller for increasing thresholds, see also Ben Bouallègue (2013)].

The development of such ANN-based combination models for the prediction of several thresholds or a probability distribution will be the subject of a forthcoming paper.

## Acknowledgments

The financial support by Deutscher Wetterdienst (DWD) for the project STOFOR through the extramural research program (EMF) is gratefully acknowledged. The authors also acknowledge support by the state of Baden-Württemberg through bwHPC.

## APPENDIX A

### Calibration

*f*(

*P*) be the self-calibrated version of a probabilistic forecast model

*P*. It can be easily seen that

*f*(

*P*) is calibrated in the sense of Eq. (2). Namely it holds that

*X*and sub

*σ*-algebra

## APPENDIX B

### Sharpness

It turns out that *f*(*P*) has the maximum variance compared to any other calibrated model *g*(*P*) that is a function of *P*.

*g*: [0, 1] → [0, 1] be any deterministic function such that

*g*(

*P*) is a well-defined random variable, which is calibrated, that is,

*f*instead of

*f*(

*P*), and

*g*instead of

*g*(

*P*). First, notice that

*Y*on the

*L*

^{2}-space of square-integrable random variables. Besides,

*f*(

*P*) is

*σ*(

*P*) measurable. With the same type of argument, one can show that

## APPENDIX C

### Limitation of *f*_{LT}

*f*

_{LT}is shown, which can be resolved with additional coefficients that may be provided (e.g., by the interaction terms in the combination model

*f*

_{LTI}). Consider the model

*f*

_{LT}with two initial forecasts

*P*

_{1}and

*P*

_{2}:

*ϕ*

_{j}reach their maximum at

*j*/

*m*with

*ϕ*

_{j}(

*j*/

*m*) = 1 for each

*j*∈ {0, …,

*m*}. For the case where

*P*

_{1}and

*P*

_{2}take values in {0, 1/

*m*, …, (

*m*− 1)/

*m*, 1} all triangular functions are zero, except for the two triangular functions, which take their maximum at (

*j*

_{1}/

*m*) =

*P*

_{1}and (

*j*

_{2}/

*m*) =

*P*

_{2}. It then holds that

*f*

_{LT}can be reduced as in Eq. (C1):

*f*

_{LT}cannot satisfy the equations for all four points and will have to pick an approximate solution.

## REFERENCES

Ariely, D., W. Tung Au, R. H. Bender, D. V. Budescu, C. B. Dietz, H. Gu, T. S. Wallsten, and G. Zauberman, 2000: The effects of averaging subjective probability estimates between and within judges.

,*J. Exp. Psychol. Appl.***6**, 130–147, https://doi.org/10.1037/1076-898X.6.2.130.Armstrong, J. S., and M. C. Grohman, 1972: A comparative study of methods for long-range market forecasting.

,*Manage. Sci.***19**, 211–221, https://doi.org/10.1287/mnsc.19.2.211.Baars, J. A., and C. F. Mass, 2005: Performance of National Weather Service forecasts compared to operational, consensus, and weighted model output statistics.

,*Wea. Forecasting***20**, 1034–1047, https://doi.org/10.1175/WAF896.1.Baran, S., and S. Lerch, 2018: Combining predictive distributions for the statistical post-processing of ensemble forecasts.

,*Int. J. Forecasting***34**, 477–496, https://doi.org/10.1016/j.ijforecast.2018.01.005.Bassetti, F., R. Casarin, and F. Ravazzolo, 2018: Bayesian nonparametric calibration and combination of predictive distributions.

,*J. Amer. Stat. Assoc.***113**, 675–685, https://doi.org/10.1080/01621459.2016.1273117.Ben Bouallègue, Z., 2013: Calibrated short-range ensemble precipitation forecasts using extended logistic regression with interaction terms.

,*Wea. Forecasting***28**, 515–524, https://doi.org/10.1175/WAF-D-12-00062.1.Bosart, L. F., 1975: SUNYA experimental results in forecasting daily temperature and precipitation.

,*Mon. Wea. Rev.***103**, 1013–1020, https://doi.org/10.1175/1520-0493(1975)103<1013:SERIFD>2.0.CO;2.Bottou, L., 2010: Large-scale machine learning with stochastic gradient descent.

*Proceedings of COMPSTAT’2010*, Y. Lechevallier and G. Saporta, Eds., Springer, 177–186.Bowler, N. E., C. E. Pierce, and A. W. Seed, 2006: STEPS: A probabilistic precipitation forecasting scheme which merges an extrapolation nowcast with downscaled NWP.

,*Quart. J. Roy. Meteor. Soc.***132**, 2127–2155, https://doi.org/10.1256/qj.04.100.Chiu, S. N., D. Stoyan, W. S. Kendall, and J. Mecke, 2013:

*Stochastic Geometry and Its Applications*. J. Wiley & Sons, 584 pp.Chollet, F., 2017:

*Deep Learning with Python*. Manning Publications, 384 pp.Clemen, R. T., 1989: Combining forecasts: A review and annotated bibliography.

,*Int. J. Forecasting***5**, 559–583, https://doi.org/10.1016/0169-2070(89)90012-5.Clemen, R. T., and R. L. Winkler, 1999: Combining probability distributions from experts in risk analysis.

,*Risk Anal.***19**, 187–203, https://doi.org/10.1111/j.1539-6924.1999.tb00399.x.Genest, C., and K. J. McConway, 1990: Allocating the weights in the linear opinion pool.

,*J. Forecasting***9**, 53–73, https://doi.org/10.1002/for.3980090106.Gneiting, T., and R. Ranjan, 2013: Combining predictive distributions.

,*Electron. J. Stat.***7**, 1747–1782, https://doi.org/10.1214/13-EJS823.Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness.

,*J. Roy. Stat. Soc.***B69**, 243–268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.Golding, B., 1998: Nimrod: A system for generating automated very short range forecasts.

,*Meteor. Appl.***5**, 1–16, https://doi.org/10.1017/S1350482798000577.Goodfellow, I., Y. Bengio, and A. Courville, 2016:

*Deep Learning*. MIT Press, 775 pp.Graham, J. R., 1996: Is a group of economists better than one? Than none?

,*J. Bus.***69**, 193–232, https://doi.org/10.1086/209688.Gyakum, J. R., 1986: Experiments in temperature and precipitation forecasting for Illinois.

,*Wea. Forecasting***1**, 77–88, https://doi.org/10.1175/1520-0434(1986)001<0077:EITAPF>2.0.CO;2.Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation.

,*Mon. Wea. Rev.***136**, 2620–2632, https://doi.org/10.1175/2007MWR2411.1.Hess, R., B. Kriesche, P. Schaumann, B. K. Reichert, and V. Schmidt, 2018: Area precipitation probabilities derived from point forecasts for operational weather and warning service applications.

,*Quart. J. Roy. Meteor. Soc.***144**, 2392–2403, https://doi.org/10.1002/qj.3306.Kober, K., G. Craig, C. Keil, and A. Dörnbrack, 2012: Blending a probabilistic nowcasting method with a high-resolution numerical weather prediction ensemble for convective precipitation forecasts.

,*Quart. J. Roy. Meteor. Soc.***138**, 755–768, https://doi.org/10.1002/qj.939.Kriesche, B., R. Hess, B. K. Reichert, and V. Schmidt, 2015: A probabilistic approach to the prediction of area weather events, applied to precipitation.

,*Spat. Stat.***12**, 15–30, https://doi.org/10.1016/j.spasta.2015.01.002.Kriesche, B., R. Hess, and V. Schmidt, 2017: A point process approach for spatial stochastic modeling of thunderstorm cells.

,*Probab. Math. Stat.***37**, 471–496, https://doi.org/10.19195/0208-4147.37.2.14.Murphy, A. H., and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature.

,*J. Roy. Stat. Soc.***C26**, 41–47, https://doi.org/10.2307/2346866.Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115**, 1330–1338, https://doi.org/10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.Pavlyshenko, B., 2018: Using stacking approaches for machine learning models.

*2018 IEEE Second Int. Conf. on Data Stream Mining & Processing (DSMP)*, Lviv, Ukraine, IEEE, 255–258, https://doi.org/10.1109/DSMP.2018.8478522.Ranjan, R., and T. Gneiting, 2010: Combining probability forecasts.

,*J. Roy. Stat. Soc.***B72**, 71–91, https://doi.org/10.1111/j.1467-9868.2009.00726.x.Sanders, F., 1963: On subjective probability forecasting.

,*J. Appl. Meteor.***2**, 191–201, https://doi.org/10.1175/1520-0450(1963)002<0191:OSPF>2.0.CO;2.Tashman, L. J., 2000: Out-of-sample tests of forecasting accuracy: An analysis and review.

,*Int. J. Forecasting***16**, 437–450, https://doi.org/10.1016/S0169-2070(00)00065-0.Theis, S., A. Hense, and U. Damrath, 2005: Probabilistic precipitation forecasts from a deterministic model: A pragmatic approach.

,*Meteor. Appl.***12**, 257–268, https://doi.org/10.1017/S1350482705001763.Vislocky, R. L., and J. M. Fritsch, 1995: Improved model output and statistics through model consensus.

,*Bull. Amer. Meteor. Soc.***76**, 1157–1164, https://doi.org/10.1175/1520-0477(1995)076<1157:IMOSFT>2.0.CO;2.Weigl, E., and T. Winterrath, 2010: Radargestützte niederschlagsanalyse und-vorhersage (radolan, radvor-op).

,*Promet (Zagreb)***35**, 78–86.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences*. 2nd ed. International Geophysics Series, Vol. 100, Academic Press, 648 pp.Wilks, D. S., 2009: Extending logistic regression to provide full-probability-distribution MOS forecasts.

,*Meteor. Appl.***16**, 361–368, https://doi.org/10.1002/met.134.Winkler, R. L., and R. M. Poses, 1993: Evaluating and combining physicians’ probabilities of survival in an intensive care unit.

,*Manage. Sci.***39**, 1526–1543, https://doi.org/10.1287/mnsc.39.12.1526.Winkler, R., A. Murphy, and R. Katz, 1977: The consensus of subjective probability forecasts: Are two, three, …, heads better than one. Preprints,

*Fifth Conf. on Probability and Statistics in Atmospheric Sciences*, Boston, MA, Amer. Meteor. Soc., 57–62.Winterrath, T., and W. Rosenow, 2007: A new module for the tracking of radar-derived precipitation with model-derived winds.

,*Adv. Geosci.***10**, 77–83, https://doi.org/10.5194/adgeo-10-77-2007.Winterrath, T., W. Rosenow, and E. Weigl, 2012: On the DWD quantitative precipitation analysis and nowcasting system for real-time application in German flood risk management.

,*IAHS Publ.***351**, 323–329.