## 1. Introduction

Important applications such as severe weather warnings or decision making in agriculture, industry, and finance strongly demand accurate weather forecasts. Usually numerical weather prediction (NWP) models are used to provide these weather forecasts. Unfortunately, because of the only roughly known current state of the atmosphere and unknown or unresolved physical processes, NWP models are always subject to error. To estimate these errors many forecasting centers nowadays provide ensemble forecasts. These are several NWP forecasts with perturbed initial conditions and/or different model formulations. However, the perturbed initial conditions do not necessarily represent initial condition uncertainty (Hamill et al. 2003; Wang and Bishop 2003) and some structural deficiencies in the models are also not accounted for. Thus, the ensemble forecasts usually do not represent the full uncertainty of NWP models. Ensemble forecasts therefore typically need to be statistically postprocessed to achieve well-calibrated probabilistic forecasts.

In the past decade a variety of different ensemble postprocessing methods have been proposed. Examples are ensemble dressing (Roulston and Smith 2003), Bayesian model averaging (Raftery et al. 2005), heteroscedastic linear regression (Gneiting et al. 2005), or logistic regression (Hamill et al. 2004). Comparisons of these and other postprocessing methods (Wilks 2006; Wilks and Hamill 2007) showed that logistic regression performs relatively well. Recently, Wilks (2009) extended logistic regression by including the (transformed) predictand thresholds as an additional predictor variable. In addition to requiring fewer coefficients and providing coherent probabilistic forecasts, this extended logistic regression allows derivation of full continuous predictive distributions. Extended logistic regression has been used frequently (Schmeits and Kok 2010; Ruiz and Saulo 2012; Roulin and Vannitsem 2012; Hamill 2012; Ben Bouallègue 2013; Scheuerer 2014; Messner et al. 2014) and has been further extended to additionally account for conditional heteroscedasticy (Messner et al. 2014). Recently, several studies noticed that extended logistic regression assumes a conditional logistic distribution for the transformed predictand (Scheuerer 2014; Schefzik et al. 2013; Messner et al. 2014) where this logistic distribution is fitted to selected predictand category probabilities.

In this study we compare (heteroscedastic) extended logistic regression with two closely related regression models from statistics that are particularly popular in econometrics (and more broadly in social sciences):

(Heteroscedastic) ordered logistic regression also provides coherent forecasts of category probabilities. However, it differs from extended logistic regression in that no continuous distribution is assumed or specified by the model.

(Heteroscedastic) censored regression also fits conditional logistic distributions to a transformed predictand but employs the full set of training-data points (as opposed to a set of thresholds) for fitting the model.

In section 2 the different statistical models are described in detail. A brief description of the data can be found in section 3. Finally, section 4 presents the results and section 5 provides a summary and discussion.

## 2. Statistical models

This section describes different statistical models to predict conditional probabilities *P*(*y* ≤ *q*_{j} | **x**) of a continuous predictand *y* falling below a threshold *q*_{j}, given a vector of predictor variables **x** = (1, *x*_{1}, *x*_{2}, …)^{T} (i.e., NWP forecasts). Conditional category probabilities of *y* to fall between two thresholds *q*_{a} and *q*_{b} can then easily be derived with *P*(*q*_{a} < *y* ≤ *q*_{b}) = *P*(*y* ≤ *q*_{b} | **x**) − *P*(*y* ≤ *q*_{a} | **x**).

### a. Separate logistic regressions (SLR)

**= (**

*β**β*

_{0},

*β*

_{1},

*β*

_{2}, …)

^{T}is a coefficient vector and Λ(·) = exp(·)/[1 + exp(·)] is notationally equivalent to the cumulative distribution function of the standard logistic distribution. The coefficient vector

**is estimated by maximizing the log-likelihood**

*β***as defined in Eq. (1), where**

*β**N*is the number of events in the dataset and

*π*

_{i}is the predicted probability of the

*i*th observed outcome:

**) are fitted for several thresholds**

*β**q*

_{j}of interest (e.g., Hamill et al. 2004; Wilks 2006; Wilks and Hamill 2007). This implies that the regression lines for different thresholds can cross, so that for some values of the predictor variables

**x**,

*P*(

*y*≤

*q*

_{a}|

**x**) >

*P*(

*y*≤

*q*

_{b}|

**x**) although

*q*

_{a}<

*q*

_{b}which leads to nonsense negative probability for

*y*to fall between

*q*

_{a}and

*q*

_{b}.

### b. Heteroscedastic extended logistic regression (HXLR)

*α*is an additional coefficient that has to be estimated and the transformation

*g*( ) is a monotone function. Equation (4) also differs from standard logistic regression, where

**is estimated separately for each threshold, in that here**

*β***is the same for all thresholds. Thus, one interpretation of Eq. (4) is that it defines parallel regression lines in log-odds space with equal slope but different intercepts [**

*β**θ*

_{j}=

*αg*(

*q*

_{j}) −

*β*

_{0}]. Figure 1 shows examples of these regression curves schematically.

*q*

_{j}(and not only the thresholds employed for estimating the model). In other words, Eq. (4) can also be interpreted as a cumulative distribution function that describes a full continuous predictive distribution. After some reformulation (see Messner et al. 2014), Eq. (4) can also be written as

*g*(

*y*) is a logistic distribution with location parameter

**x**

^{T}

**/**

*β**α*and scale parameter 1/

*α*. Thus, the transformation

*g*( ) must be chosen such that the transformed predictand can be assumed to follow a conditional (on the predictors

**x**) logistic distribution.

**z**= 1,

*z*

_{1},

*z*

_{2}, …)

^{T}(e.g., the ensemble spread) to directly control the dispersion (variance) of the logistic predictive distribution:

**= (**

*γ**γ*

_{0},

*γ*

_{1},

*γ*

_{2}, …)

^{T}and

**= (**

*δ**δ*

_{0},

*δ*

_{1},

*δ*

_{2}, …)

^{T}are the coefficient vectors that have to be estimated. The exponential function is used as a simple method to ensure positive values (Messner et al. 2014).

**and**

*γ***are also estimated by maximizing the log-likelihood function given by Eq. (2). However, the probability of the observed outcome for the multicategorical predictand is**

*δ**J*is the number of thresholds

*q*

_{j}that have been selected for the fitting calculation.

### c. Heteroscedastic ordered logistic regression (HOLR)

*θ*

_{j}are fitted for each selected threshold instead of modeling them as a linear function of the (transformed) thresholds:

*θ*

_{j}are only constrained to be ordered (

*θ*

_{1}≤

*θ*

_{2}≤

*θ*

_{J}) for ordered thresholds

*q*

_{j}. Because the intercepts of the regression lines are fully determined by

*θ*

_{j}further intercepts are not needed anymore so that

**x**= (

*x*

_{1},

*x*

_{2}, …)

^{T}must not contain any constant. Similar to extended logistic regression

**is the same for all thresholds.**

*β*The separate intercepts for each threshold imply the estimation of more coefficients than for extended logistic regression. Furthermore, only the probabilities for the thresholds *q*_{j} employed in the estimation can be derived, so that Eq. (8) does not specify full continuous predictive distributions. In return, ordered logistic regression does not assume a continuous distribution for the transformed predictand. Thus, no (possibly nonexistent) transformation has to be determined to fulfill this assumption.

**z**= (

*z*

_{1},

*z*

_{2}, …)

^{T}.

Maximum likelihood estimation with the same log-likelihood function as for extended logistic regression [Eqs. (2) and (7)] is used to estimate the coefficients *θ*_{j}, ** γ**, and

**.**

*δ*### d. Heteroscedastic censored logistic regression (HCLR)

*λ*[·] denotes the likelihood function of the standard logistic distribution. The likelihood is notationally identical to the probability density function [i.e., the derivative of Eq. (6) with respect to

*g*(

*q*

_{j})], but differs because it is a function of the parameter vectors

**and**

*γ***for a fixed predictand value**

*δ**y*

_{i}, rather than being a function of

*y*

_{i}given fixed values for

**and**

*γ***. In this way, the**

*δ**π*

_{i}employed for fitting the model are not the likelihoods for predictands falling into discrete intervals, but rather the likelihoods that they take on their exact observed values. This model can also be interpreted as a linear regression model with a (heteroscedastic) logistic error distribution.

*π*

_{i}are replaced by

This heteroscedastic censored logistic regression fits a logistic error distribution with point mass at zero to the transformed predictand. While such an error distribution seems reasonable for square root transformed precipitation amounts (Scheuerer 2014; Schefzik et al. 2013), usually other error distributions are assumed for wind speed. For example Thorarinsdottir and Gneiting (2010) proposed to fit a truncated normal distribution to the *untransformed* wind speed. In this case, in Eqs. (6) and (10) the logistic distribution is replaced with a truncated normal distribution and *g*(*y*) is set to *g*(*y*) = *y*. Note that Thorarinsdottir and Gneiting (2010) also called this model heteroscedastic censored regression although actually the data are considered to be truncated and not censored. In the following we therefore denote this model as heteroscedastic truncated Gaussian regression (HTGR)*,* which we also employ as benchmark model for wind speed.

### e. Comparison

Table 1 summarizes the major differences between the four different logistic regression models that were presented above. Extended logistic regression (XLR) and censored logistic regression (CLR) (and their heteroscedastic versions HXLR and HCLR, respectively) are essentially the same models and only differ in their parameter estimation. They have the fewest parameters of the compared models but imply continuous distribution assumptions. Ordered logistic regression (OLR) and its heteroscedastic version (HOLR) avoid this continuous distribution assumption but require estimation of more coefficients than (H)XLR and (H)CLR. With its unconstrained slope estimates, separate logistic regressions SLR is more flexible than OLR but requires estimation of even more coefficients. Figure 1 shows schematic parallel regression lines for XLR, CLR, or OLR. In contrast to these models, regression curves from SLR would not be constrained to be parallel and so could potentially cross, which would lead to nonsense negative probabilities.

Overview of the different logistic regression models with respect to their parameterization and the likelihood. Here *K* is the number of predictor variables (*x*_{1}, *x*_{2}, …, *z*_{1}, *z*_{2}, …) and *J* is the number of thresholds *q*_{j}.

## 3. Data

To compare the presented ensemble postprocessing methods, we used 10-m wind speed observations (10-min average) and 24-h accumulated precipitation amount from the 10 European weather stations: Wien—Hohe-Warte, Austria (48.249°N, 16.356°E); Paris—Orly, France (48.717°N, 2.383°E); Amsterdam—Schiphol, Netherlands (52.3°N, 4.783°E); Berlin—Tegel, Germany (52.55°N, 13.3°E); Brussels—National, Belgium (50.9°N, 4.533°E); Frankfurt—Main, Germany (50.033°N, 8.583°E); London—Heathrow, United Kingdom (51.467°N, −0.45°E); Lisbon—Geof, Portugal (38.767°N, −9.133°E); Madrid—Barajas, Spain (40.467°N, −3.55°E); and Rome—Fiumicino, Italy (41.8°N, 12.233°E). As input for the statistical models, 10-m wind speed and total precipitation ensemble forecasts from the ECMWF were linearly interpolated from neighboring grid points to the station locations. The data were available from April 2010 to December 2012 (approximately 1000 days) and separate models were fitted for the lead times 24, 48, and 96 h, respectively.

Since the predictands were square root transformed for most regression models (see section 4) we mainly used the mean and standard deviation of *square root transformed* ensemble forecasts as predictor variables. For HTGR the untransformed predictand is used, following Thorarinsdottir and Gneiting (2010). Consequently we employed the mean and standard deviation of the *untransformed* ensemble forecasts as input for this model.

As thresholds *q*_{j} we defined *J* = 9 climatological deciles that are estimated for each location and predictand variable separately. Note that for precipitation several deciles are 0 and are merged to one threshold so that the effective number of thresholds is smaller (e.g., *J* = 4 in Wien—Hohe-Warte and *J* = 5 in Paris—Orly for precipitation).

We found the ensemble standard deviation to improve the forecasts of all statistical models, indicating useful spread–skill relationships. Therefore, we only show results for the heteroscedastic models in the following. For separate logistic regressions the product of ensemble mean and spread is included as additional predictor variable (Wilks and Hamill 2007). Table 2 lists the different models that are compared in the following in detail.

List of different statistical models. Here *g*(*y*) is the transformation, **x** are vectors of predictor variables for the location (mean), and **z** are predictor variables for the scale (variance). The *M* and *S* are the mean and standard deviation of square root transformed ensemble forecasts, respectively; and *M*_{r} and *S*_{r} are the mean and standard deviation of the untransformed ensemble forecasts, respectively. For wind speed forecasts *M*, *S*, *M*_{r}, and *S*_{r} are derived from 10-m wind speed ensemble forecasts and for precipitation forecasts *M* and *S* are derived from total precipitation ensemble forecasts.

## 4. Results

Before comparing the performance of the different ensemble postprocessing methods we show how ordered logistic regression can be used to determine appropriate transformations *g*( ) for extended logistic regression. The crosses and plus signs in Fig. 2 show the fitted intercepts from ordered logistic regression (HOLR) for two predictands and two selected locations. For both locations and variables these plots suggest that the intercepts can be parameterized as being proportional to the square roots of the thresholds. Thus, we fitted HXLR models with

Intercepts *θ*_{j} from heteroscedastic ordered (HOLR) and extended logistic regression (HXLR) relative to threshold values, for the locations Wien—Hohe-Warte and Paris—Orly, lead time 48 h, and the predictands (left) wind speed and (right) 24-h accumulated precipitation amount. For better comparability intercepts are normalized with *β*_{1}, respectively. The square root is used as transformation for HXLR

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Intercepts *θ*_{j} from heteroscedastic ordered (HOLR) and extended logistic regression (HXLR) relative to threshold values, for the locations Wien—Hohe-Warte and Paris—Orly, lead time 48 h, and the predictands (left) wind speed and (right) 24-h accumulated precipitation amount. For better comparability intercepts are normalized with *β*_{1}, respectively. The square root is used as transformation for HXLR

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Intercepts *θ*_{j} from heteroscedastic ordered (HOLR) and extended logistic regression (HXLR) relative to threshold values, for the locations Wien—Hohe-Warte and Paris—Orly, lead time 48 h, and the predictands (left) wind speed and (right) 24-h accumulated precipitation amount. For better comparability intercepts are normalized with *β*_{1}, respectively. The square root is used as transformation for HXLR

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

*J*is the number of thresholds and

*I*(·) is the indicator function. For each model, forecast location, and lead time we applied 10-fold cross validation to get independent training and test datasets. Therefore, the data are divided into 10 equally sized blocks and in each block the RPS were computed for the models that were trained on the 9 remaining blocks, respectively. Consequently, the effective training data length is

Figure 3 shows the RPSS relative to HXLR for different models, lead times, locations, and predictand variables. HOLR performs equally well or slightly better than HXLR for all locations, lead times, and predictand variables. For precipitation in Paris forecasts of HXLR and HOLR are nearly identical, which is consistent with Fig. 2 where the HXLR intercept function almost perfectly interpolates the HOLR intercepts. SLR generally performs worse than HXLR, exceptions are wind speed forecasts in Wien for 24- and 96-h lead time and precipitation forecasts in Paris for 24-h lead time. However, note that the RPS [Eq. (12)] does not penalize the partly inconsistent forecasts from SLR. HCLR and HTGR also tend to perform worse than HXLR, especially for wind speed. While for Paris HTGR is slightly better than HCLR there is no clear preference for one of these models in Wien or the aggregated locations. For Fig. 3 we used nine climatological deciles as thresholds *q*_{j} for estimation and verification. We additionally also tested different other numbers of climatological quantiles. However, apart from SLR and HOLR reaching slightly better skills for fewer quantiles, the results are very similar and therefore not shown.

Ranked probability skill score (RPSS) relative to heteroscedastic extended logistic regression (HXLR) of (left) wind speed and (right) 24-h accumulated precipitation amount for different models (see Table 2 for details) aggregated over 10 European locations (see text for details) and the selected locations Wien—Hohe-Warte and Paris—Orly. Nine climatological deciles that were computed separately for each forecast location are used as thresholds. Because for precipitation several thresholds are 0 the effective number of deciles is smaller (e.g., 4 in Wien and or 5 in Paris). The effective training data length is approximately 900 days. Positive values indicate improvements over HXLR. The solid circles mark the median and the boxes the interquartile ranges of the 250 values from the bootstrapping approach, the whiskers show the most extreme values that are less than 1.5 times the length of the box away from the box, and empty circles are plotted for values that are outside the whiskers.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Ranked probability skill score (RPSS) relative to heteroscedastic extended logistic regression (HXLR) of (left) wind speed and (right) 24-h accumulated precipitation amount for different models (see Table 2 for details) aggregated over 10 European locations (see text for details) and the selected locations Wien—Hohe-Warte and Paris—Orly. Nine climatological deciles that were computed separately for each forecast location are used as thresholds. Because for precipitation several thresholds are 0 the effective number of deciles is smaller (e.g., 4 in Wien and or 5 in Paris). The effective training data length is approximately 900 days. Positive values indicate improvements over HXLR. The solid circles mark the median and the boxes the interquartile ranges of the 250 values from the bootstrapping approach, the whiskers show the most extreme values that are less than 1.5 times the length of the box away from the box, and empty circles are plotted for values that are outside the whiskers.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Ranked probability skill score (RPSS) relative to heteroscedastic extended logistic regression (HXLR) of (left) wind speed and (right) 24-h accumulated precipitation amount for different models (see Table 2 for details) aggregated over 10 European locations (see text for details) and the selected locations Wien—Hohe-Warte and Paris—Orly. Nine climatological deciles that were computed separately for each forecast location are used as thresholds. Because for precipitation several thresholds are 0 the effective number of deciles is smaller (e.g., 4 in Wien and or 5 in Paris). The effective training data length is approximately 900 days. Positive values indicate improvements over HXLR. The solid circles mark the median and the boxes the interquartile ranges of the 250 values from the bootstrapping approach, the whiskers show the most extreme values that are less than 1.5 times the length of the box away from the box, and empty circles are plotted for values that are outside the whiskers.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Because the different statistical models differ considerably in their number of estimated coefficients (SLR: 3*J*, HOLR: 2 + *J*, HXLR, HCLR, HTGR: 4) it is also interesting to compare their performance for different training data lengths. Figure 4 shows RPSS for wind speed and precipitation forecasts for 48-h lead time at Wien and Paris, relative to the raw ensemble interval relative frequencies. Similar to Fig. 3 the RPS are computed with 10-fold cross validation but for each test sample, only a subset of the remaining data are used for training. It can be seen that almost all models lose skill with a reduced training dataset. With the largest parameter count SLR clearly loses most. In contrast HOLR generally exhibits comparable skill reductions as HXLR and HCLR in response to decreasing training data although more parameters have to be estimated. Interestingly, for wind speed in Paris the skill of HCLR seems not to depend on the training data length and is therefore superior to the other models for short training datasets.

Ranked probability skill score (RPSS) relative to the raw ensemble (ensemble relative frequencies within each interval) for different training data lengths and models (see Table 2 for details) and lead time 48 h.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Ranked probability skill score (RPSS) relative to the raw ensemble (ensemble relative frequencies within each interval) for different training data lengths and models (see Table 2 for details) and lead time 48 h.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Ranked probability skill score (RPSS) relative to the raw ensemble (ensemble relative frequencies within each interval) for different training data lengths and models (see Table 2 for details) and lead time 48 h.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Continuous ranked probability skill score (CRPSS) relative to heteroscedastic extended logistic regression (HXLR) and their bootstrap sampling distributions for different predictands, models (see Table 2 for details), and lead time 48 h, aggregated over 10 European locations, and the selected locations Wien—Hohe-Warte and Paris—Orly, respectively. Positive values indicate improvements over HXLR.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Continuous ranked probability skill score (CRPSS) relative to heteroscedastic extended logistic regression (HXLR) and their bootstrap sampling distributions for different predictands, models (see Table 2 for details), and lead time 48 h, aggregated over 10 European locations, and the selected locations Wien—Hohe-Warte and Paris—Orly, respectively. Positive values indicate improvements over HXLR.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Continuous ranked probability skill score (CRPSS) relative to heteroscedastic extended logistic regression (HXLR) and their bootstrap sampling distributions for different predictands, models (see Table 2 for details), and lead time 48 h, aggregated over 10 European locations, and the selected locations Wien—Hohe-Warte and Paris—Orly, respectively. Positive values indicate improvements over HXLR.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

For wind speed, Fig. 5 also shows the CRPSS for HTGR. As in Fig. 3 HCLR and HTGR show similar CRPSS for Wien while HTGR is slightly preferred for Paris and the aggregated locations, which could indicate that the real error distribution is better estimated by a truncated normal than by a censored transformed logistic distribution.

Since HXLR fits the selected category probabilities it is also interesting how the choice of the thresholds that define these categories affects the quality of the predictive distribution. Figure 6 shows the CRPSS of HCLR relative to HXLR for different numbers of climatological quantiles that are used to fit HXLR. Since HCLR could also be interpreted as HXLR with infinitesimal category intervals it is not surprising that the CRPS of HXLR and HCLR become more similar with higher numbers of thresholds. Although the patterns look similar for both predictand variables, precipitation forecasts lose much more skill for few thresholds.

Continuous ranked probability skill score (CRPSS) of HCLR relative to HXLR for different numbers of climatological quantiles as thresholds, Wien—Hohe-Warte, lead time 48 h, and the predictand variables (left) wind speed and (right) precipitation. Note that the scales in the two panels are different. If two or more quantiles are equal they are merged to one. The shaded areas show the 90% confidence intervals from bootstrapping.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Continuous ranked probability skill score (CRPSS) of HCLR relative to HXLR for different numbers of climatological quantiles as thresholds, Wien—Hohe-Warte, lead time 48 h, and the predictand variables (left) wind speed and (right) precipitation. Note that the scales in the two panels are different. If two or more quantiles are equal they are merged to one. The shaded areas show the 90% confidence intervals from bootstrapping.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Continuous ranked probability skill score (CRPSS) of HCLR relative to HXLR for different numbers of climatological quantiles as thresholds, Wien—Hohe-Warte, lead time 48 h, and the predictand variables (left) wind speed and (right) precipitation. Note that the scales in the two panels are different. If two or more quantiles are equal they are merged to one. The shaded areas show the 90% confidence intervals from bootstrapping.

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Finally, Figs. 7 and 8 show reliability diagrams (e.g., Wilks 2011) for the lower and upper climatological deciles, respectively, for 48-h lead time at Wien. With few exceptions the observed conditional relative frequencies of both predictand variables lie within the 95% consistency intervals (Bröcker and Smith 2007) with only minor differences between the different statistical models. The refinement distributions in Figs. 7 and 8 show the frequencies of the predicted probabilities. Similar to the calibration function the different models show only minor differences. Only for zero precipitation SLR and HOLR have slightly sharper forecasts than HXLR and HCLR (forecasts more frequently close to 0 and 1).

Reliability diagrams for predicted probabilities to fall below the first climatological decile *P*(*y* ≤ *q*_{1} | **x**) for Wien—Hohe-Warte, lead time 48 h, and different models. Forecasts are aggregated in 0.1 probability intervals. Calibration functions for *wind speed* are plotted as red “×” and for *precipitation amount* as green “+” and are only shown for intervals with more than 10 forecasts. Refinement distributions for wind speed are plotted in the bottom-right corner in red and for precipitation in the top-left corner in green. 95% consistency intervals derived from consistency resampling (Bröcker and Smith 2007) are shown as red and green shaded areas, respectively. Note that because of the frequent zero observations *q*_{1} = *q*_{2} = ⋯ = *q*_{6} for precipitation so that *P*(*y* ≤ *q*_{1} | **x**) = *P*(*y* ≤ *q*_{6} | **x**) = *P*(*y* = 0 | **x**).

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Reliability diagrams for predicted probabilities to fall below the first climatological decile *P*(*y* ≤ *q*_{1} | **x**) for Wien—Hohe-Warte, lead time 48 h, and different models. Forecasts are aggregated in 0.1 probability intervals. Calibration functions for *wind speed* are plotted as red “×” and for *precipitation amount* as green “+” and are only shown for intervals with more than 10 forecasts. Refinement distributions for wind speed are plotted in the bottom-right corner in red and for precipitation in the top-left corner in green. 95% consistency intervals derived from consistency resampling (Bröcker and Smith 2007) are shown as red and green shaded areas, respectively. Note that because of the frequent zero observations *q*_{1} = *q*_{2} = ⋯ = *q*_{6} for precipitation so that *P*(*y* ≤ *q*_{1} | **x**) = *P*(*y* ≤ *q*_{6} | **x**) = *P*(*y* = 0 | **x**).

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

Reliability diagrams for predicted probabilities to fall below the first climatological decile *P*(*y* ≤ *q*_{1} | **x**) for Wien—Hohe-Warte, lead time 48 h, and different models. Forecasts are aggregated in 0.1 probability intervals. Calibration functions for *wind speed* are plotted as red “×” and for *precipitation amount* as green “+” and are only shown for intervals with more than 10 forecasts. Refinement distributions for wind speed are plotted in the bottom-right corner in red and for precipitation in the top-left corner in green. 95% consistency intervals derived from consistency resampling (Bröcker and Smith 2007) are shown as red and green shaded areas, respectively. Note that because of the frequent zero observations *q*_{1} = *q*_{2} = ⋯ = *q*_{6} for precipitation so that *P*(*y* ≤ *q*_{1} | **x**) = *P*(*y* ≤ *q*_{6} | **x**) = *P*(*y* = 0 | **x**).

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

As in Fig. 7, but for predicted probabilities to fall below the upper climatological decile *P*(*y* ≤ *q*_{9} | **x**).

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

As in Fig. 7, but for predicted probabilities to fall below the upper climatological decile *P*(*y* ≤ *q*_{9} | **x**).

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

As in Fig. 7, but for predicted probabilities to fall below the upper climatological decile *P*(*y* ≤ *q*_{9} | **x**).

Citation: Monthly Weather Review 142, 8; 10.1175/MWR-D-13-00355.1

## 5. Summary and conclusions

Extended logistic regression fits predictand category probabilities by assuming a conditional logistic distribution for the transformed predictand (Scheuerer 2014; Schefzik et al. 2013; Messner et al. 2014). However, for some applications the transformed predictand cannot be assumed to follow a logistic distribution. Moreover, fitting selected category probabilities implies disregarding available information when the predictand is actually given in continuous form.

In this study we compared extended logistic regression with two closely related regression models from statistics and econometrics. Ordered logistic regression is very similar to extended logistic regression but avoids a continuous distribution assumption. On the other hand, censored logistic regression fits the same model as extended logistic regression but uses each individual predictand value in the training dataset instead of the selected category probabilities. As further benchmark models we also employed separate logistic regressions and a truncated Gaussian regression model (Thorarinsdottir and Gneiting 2010). The performance of the different statistical models was tested with wind speed and precipitation data from 10 European locations and ensemble forecasts from the ECMWF. Overall, the logistic distribution assumption seemed to be quite appropriate for the square root–transformed predictands for both predictand variables. Thus, the performance differences between ordered and extended logistic regression were only minor. However, because no continuous distribution has to be assumed, ordered logistic regression should generally be preferred if solely threshold probabilities are required.

Since extended logistic regression fits selected category probabilities, it is actually not surprising that RPS skills are higher for this model than for censored logistic regression, which fits the full continuous predictive distribution. For the same reason it is unsurprising that censored logistic regression performed better than extended logistic regression according to CRPS skill, which evaluates accuracy of the full predictive distributions.

Extended and censored logistic regression assume censored conditional logistic distributions for the transformed predictand. In contrast, wind speed was assumed to follow a truncated normal distribution in Thorarinsdottir and Gneiting (2010). A comparison between censored and truncated regression models showed that the assumption of a truncated normal distribution resulted in slightly better wind speed forecasts than the assumption of a censored transformed logistic distribution.

Our results show that the optimal statistical model strongly depends on the intended application. Ordered logistic regression was best suited for category probability predictions for the forecasts considered here, given sufficiently long training series. When the transformed predictand can be assumed to follow a conditional logistic distribution then extended logistic regression provides equally good category probability forecasts while requiring fewer coefficients and additionally specifying full predictive distributions. However, if the primary interest is in predicting full continuous probability distributions, censored or truncated regression models should be preferred because they use the information contained in the training data more fully.

## Acknowledgments

We thank three anonymous reviewers for their valuable comments to improve this manuscript. This study was supported by the Austrian Science Fund (FWF): L615-N10. The first author was also supported by a Ph.D. scholarship from the University of Innsbruck, Vizerektorat für Forschung. Data from the ECMWF forecasting system were obtained from the ECMWF Data Server.

## APPENDIX

### Computational Details

Our results were obtained on Ubuntu Linux using the statistical software R 2.15.2 (R Core Team 2013). Heteroscedastic extended logistic regression and heteroscedastic censored logistic regression were fitted using the package crch 0.1–0 (Messner and Zeileis 2013). For ordered logistic regression models we used the package ordinal 2012.09–11 (Christensen 2013).

## REFERENCES

Agresti, A., 2002:

*Categorical Data Analysis.*2nd ed. John Wiley & Sons, 734 pp.Ben Bouallègue, Z., 2013: Calibrated short-range ensemble precipitation forecasts using extended logistic regression with interaction terms.

,*Wea. Forecasting***28**, 515–524, doi:10.1175/WAF-D-12-00062.1.Bröcker, J., and L. A. Smith, 2007: Increasing the reliability of reliability diagrams.

,*Wea. Forecasting***22**, 651–661, doi:10.1175/WAF993.1.Christensen, R. H. B., 2013: ordinal: Regression Models for Ordinal Data, version 2013.09-30. R package. [Available online at http://CRAN.R-project.org/package=ordinal.]

Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories.

,*J. Appl. Meteor.***8**, 985–987, doi:10.1175/1520-0450(1969)008<0985:ASSFPF>2.0.CO;2.Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118, doi:10.1175/MWR2904.1.Hamill, T. M., 2012: Verification of TIGGE multimodel and ECMWF reforecast-calibrated probabilistic precipitation forecasts over the contiguous United States.

,*Mon. Wea. Rev.***140**, 2232–2252, doi:10.1175/MWR-D-11-00220.1.Hamill, T. M., C. Snyder, and J. S. Whitaker, 2003: Ensemble forecasts and the properties of flow-dependent analysis-error covariance singular vectors.

,*Mon. Wea. Rev.***131**, 1741–1758, doi:10.1175//2559.1.Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts.

,*Mon. Wea. Rev.***132**, 1434–1447, doi:10.1175/1520-0493(2004)132<1434:ERIMFS>2.0.CO;2.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15**, 559–570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions.

,*Manage. Sci.***22**, 1087–1096, doi:10.1287/mnsc.22.10.1087.Messner, J. W., and A. Zeileis, 2013: crch: Censored Regression with Conditional Heteroscedasticity, version 0.1-0. R package. [Available online at http://CRAN.R-project.org/package=crch.]

Messner, J. W., A. Zeileis, G. J. Mayr, and D. S. Wilks, 2014: Heteroscedastic extended logistic regression for postprocessing of ensemble guidance.

,*Mon. Wea. Rev.***142**, 448–456, doi:10.1175/MWR-D-13-00271.1.Nelder, J. A., and R. W. M. Wedderburn, 1972: Generalized linear models.

,*J. Roy. Stat. Soc.***135A**, 370–384, doi:10.2307/2344614.Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133**, 1155–1174, doi:10.1175/MWR2906.1.R Core Team, 2013: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Available online at http://www.R-project.org/.]

Roulin, E., and S. Vannitsem, 2012: Postprocessing of ensemble precipitation predictions with extended logistic regression based on hindcasts.

,*Mon. Wea. Rev.***140**, 874–888, doi:10.1175/MWR-D-11-00062.1.Roulston, M. S., and L. A. Smith, 2003: Combining dynamical and statistical ensembles.

,*Tellus***55A**, 16–30, doi:10.1034/j.1600-0870.2003.201378.x.Ruiz, J. J., and C. Saulo, 2012: How sensitive are probabilistic precipitation forecasts to the choice of calibration algorithms and the ensemble generation method? Part I: Sensitivity to calibration methods.

,*Meteor. Appl.***19**, 302–313, doi:10.1002/met.286.Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling.

,*Stat. Sci.***28,**616–640, doi:10.1214/13-STS443.Scheuerer, M., 2014: Probabilistic quantitative precipitation forecasting using ensemble model output statistics.

,*Quart. J. Roy. Meteor. Soc.***140,**1086–1096, doi:10.1002/qj.2183.Schmeits, M. J., and K. J. Kok, 2010: A comparison between raw ensemble output, (modified) Bayesian model averaging, and extended logistic regression using ECMWF ensemble precipitation reforecasts.

,*Mon. Wea. Rev.***138**, 4199–4211, doi:10.1175/2010MWR3285.1.Thorarinsdottir, T. L., and T. Gneiting, 2010: Probabilistic forecasts of wind speed: Ensemble model output statistics by using heteroscedastic censored regression.

,*J. Roy. Stat. Soc.***173A**, 371–388, doi:10.1111/j.1467-985X.2009.00616.x.Tobin, J., 1958: Estimation of relationships for limited dependent variables.

,*Econometrica***26**, 24–36, doi:10.2307/1907382.Wang, X., and C. H. Bishop, 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes.

,*J. Atmos. Sci.***60**, 1140–1158, doi:10.1175/1520-0469(2003)060<1140:ACOBAE>2.0.CO;2.Wilks, D. S., 2006: Comparison of ensemble-MOS methods in the Lorenz ’96 setting.

,*Meteor. Appl.***13**, 243–256, doi:10.1017/S1350482706002192.Wilks, D. S., 2009: Extending logistic regression to provide full-probability-distribution MOS forecasts.

,*Meteor. Appl.***16**, 361–368, doi:10.1002/met.134.Wilks, D. S., 2011:

*Statistical Methods in the Atmospheric Sciences*. 3rd ed. Academic Press, 676 pp.Wilks, D. S., and T. M. Hamill, 2007: Comparison of ensemble-MOS methods using GFS reforecasts.

,*Mon. Wea. Rev.***135**, 2379–2390, doi:10.1175/MWR3402.1.