This paper presents an approach to postprocess ensemble forecasts for the discrete and bounded weather variable of total cloud cover. Two methods for discrete statistical postprocessing of ensemble predictions are tested: the first approach is based on multinomial logistic regression and the second involves a proportional odds logistic regression model. Applying them to total cloud cover raw ensemble forecasts from the European Centre for Medium-Range Weather Forecasts improves forecast skill significantly. Based on stationwise postprocessing of raw ensemble total cloud cover forecasts for a global set of 3330 stations over the period from 2007 to early 2014, the more parsimonious proportional odds logistic regression model proved to slightly outperform the multinomial logistic regression model.
Forecasts of total cloud cover (TCC) are an important part of numerical weather prediction (NWP) both in terms of model feedbacks and with respect to forecast users in areas such as energy demand and production, agriculture, and tourism. In NWP models cloud cover affects the evolution of the model state through feedback loops on radiative fluxes and heating rates (Köhler 2005; Haiden and Trentmann 2016). Predictions of energy demand and production rely in part on TCC forecasts. Photovoltaic energy forecasting in particular relies on accurate predictions of solar irradiance, which is on a day-to-day basis mainly determined by variations in TCC (Taylor and Buizza 2003; Pelland et al. 2013). Observational astronomy depends on reliable TCC forecasts (Ye and Chen 2013). Other applications of TCC forecasts can be found in agriculture, where they may facilitate irrigation scheduling (Diak et al. 1998), in avalanche forecasting, where the amount of radiational cooling influences the stability of snowpacks (McClung 2002), and in leisure activities where cloudiness influences, for example, the amount of sun protection required (Dixon et al. 2008).
(Total) cloud cover is defined as the “portion of the sky cover that is attributed to clouds…” (American Meteorological Society 2015). Obviously, TCC takes values in , and unlike other weather variables, such as temperature or precipitation, TCC is reported and forecast on a discrete space with only a small number of possible values. Usually, observers report TCC as values in , henceforth called octas. At the European Centre for Medium-Range Weather Forecasts (ECMWF) probabilistic TCC forecasts are provided as direct output from the NWP ensemble. The skill of NWP TCC forecasts in the short and medium range is low compared to the forecasts for other meteorological variables like 6-h accumulated precipitation, geopotential, 2-m temperature, or 10-m wind speed (Köhler 2005). In 2004 the high-resolution (HRES) ECMWF TCC forecasts showed skill compared to persistence only up to forecast day 3 over Europe. Furthermore, Haiden and Trentmann (2016) showed that the skill of 24-h HRES TCC forecasts verified against a set of European stations improved little over the last decade.
The limited skill of direct model output TCC point forecasts is partly due to a representativeness mismatch between models and observations. Areas covered by visual observations typically vary in scale from 10 to 100 km, depending on visibility and topography (Mittermaier 2012). Automated observations as derived from ceilometers measure cloud cover directly overhead. Depending on the wind speed in the cloud layer the scanned area may or may not be representative of the model grid scale. Temporal variability of cloudiness on hourly and subhourly scales presents an additional challenge for predicting instantaneous TCC. As shown by Haiden et al. (2015), the forecast range over which there is positive skill relative to persistence increases from 2–3 days to 5 days if daytime averages rather than instantaneous values of TCC are considered.
The potential benefits of skillful TCC forecasts together with the relatively low performance of state-of-the-art NWP TCC point forecasts (i.e., forecasts interpolated from the NWP model grid to specific sites), motivates the development of statistical methods to postprocess raw ensemble TCC forecasts. In this study, we focus on global point forecasts of TCC from the ECMWF ensemble forecast system. To take account of the discrete nature of TCC, two discrete statistical postprocessing methods are proposed: a method based on multinomial (or polytomous) logistic regression (MLR; see e.g., Agresti and Kateri 2011), and a method based on proportional odds logistic regression (POLR; Walker and Duncan 1967; McCullagh 1980; Ananth and Kleinbaum 1997; Messner et al. 2014). In the field of meteorological forecasting several (postprocessing) approaches based on logistic regression have been proposed over the last 15 years. Applequist et al. (2002) applied logistic regression to produce forecasts of precipitation threshold exceedance probabilities. Hamill et al. (2004) used logistic regression to obtain probabilistic forecasts of temperature and precipitation from ensemble model output statistics. Wilks (2009) proposed extended logistic regression (ELR) as a further development of the approach by Hamill et al. (2004) that provides full predictive distributions from ensemble model output statistics. ELR has been used to postprocess NWP ensemble precipitation (and much less frequently also wind speed) forecasts in many studies. Schmeits and Kok (2010) compared raw ensemble forecasts from a 20-yr ECMWF precipitation reforecast dataset with Bayesian model averaging (BMA; Raftery et al. 2005) and ELR. While ELR outperformed the raw ensemble only slightly in case of area-mean precipitation amounts, area-maximum forecast skill was significantly improved by ELR. Furthermore, ELR performed considerably better than BMA and equally well as a modified BMA approach by Schmeits and Kok (2010). A similar study by Roulin and Vannitsem (2012) showed that applying ELR led to substantially improved skill and mean error of ECMWF precipitation ensemble forecasts for two catchments in Belgium. Likewise, Ben Bouallègue (2013) confirmed the good performance of ELR. However, there are also studies that reveal the limitations of ELR. In a case study comparing eight different postprocessing methods for (ensemble) precipitation forecasts over South America, ELR ranks in the upper midrange among the methods considered (Ruiz and Saulo 2012). Hamill (2012) showed that ELR improved skill of ECMWF precipitation ensemble forecasts considerably over the United States, but that the multimodel ensemble consisting of the ensemble forecasts from the ECMWF, the Met Office, the National Centers for Environmental Prediction, and the Canadian Meteorological Centre could not be improved much by ELR. Scheuerer (2014) was able to outperform ELR by applying an ensemble model output statistics approach (Gneiting et al. 2005) based on a generalized extreme value distribution. Messner et al. (2014) applied ELR, censored logistic regression, and POLR to ECMWF ensemble wind speed and precipitation forecasts. Their study revealed the good performance of POLR on discrete, categorical sample spaces. However, we are not aware of any study that postprocesses TCC ensemble forecasts based on a logistic regression approach.
Like in Hemri et al. (2014) the TCC dataset (T. Haiden 2014, unpublished data) used in this study consists of stationwise daily time series from January 2002 to March 2014 of forecast–observation pairs at 1200 UTC for lead times up to 10 days. As ECMWF forecasts are issued on the global domain, we have selected 3435 surface synoptic observations (SYNOP) stations that cover the entire globe (except from Australia, which does not report at 1200 UTC) as observational dataset. Stations with unreliable observation time series are detected and removed according to the following scheme, which is a modification of the approach by Pinson and Hagedorn (2012):
Count the number of days with observed values that are equal to the observations from the previous 10 days. If this number exceeds 20% of the length of the time series, a station is considered to be unreliable.
Additionally, remove stations with recorded observations outside the range [0, 1].
After removing the unreliable stations, 3330 are left for the following analyses.
a. Training and verification periods
Prior to introducing the different forecast models, the training periods used for estimation of the parameters of the statistical postprocessing models are presented here along with the corresponding verification periods. In line with Hemri et al. (2014) rather long training periods of up to 5 years are applied. Accordingly, the verification period extends from January 2007 to March 2014. The corresponding training periods are selected in a nonseasonal and in a seasonal way. In case of the nonseasonal approach, for any verification day x the corresponding training period covers the five calendar years prior to the day x. For instance, for a random verification day x in 2009, say 27 June 2009, the corresponding training period lasts from 1 January 2004 to 31 December 2008. The same training period would apply for any other verification day in 2009. In case of the seasonal approach, the blockwise training periods from the nonseasonal approach are additionally differentiated according to the season of the verification day. For this study, we divide the year into two seasons (April–September and October–March).
b. Climatological and uniform forecasts
Climatological and uniform forecasts are used as reference. The climatological forecasts are constructed stationwise in the same way as the seasonal training periods. That is, for each verification day the climatological forecast corresponds to the empirical distribution of all TCC observations in the same season (winter half-year or summer half-year) within the five calendar years prior to the verification day. The uniform forecasts simply assign a probability of to each TCC level in irrespective of station climatology and NWP model output.
c. Raw ensemble forecasts
The ECMWF TCC forecasts used in this study are issued daily at 1200 UTC from 1 January 2002 to 20 March 2014 and cover the lead times days. In the following, we will focus mostly on the lead times 3, 6, and 10 days, which reflect sequentially decreasing predictability and are representative for the other lead times. Besides the HRES run, the ECMWF ensemble forecasting system consists of the 50-member ensemble (ENS) and the control (CTRL) runs. From now on, this 52-member ensemble is called raw ensemble. Details on the ECMWF forecasting system can be found in Molteni et al. (1996) and Buizza et al. (2007).
d. MLR and POLR
As stated in section 1 statistical postprocessing methods for TCC should take account of the discrete nature of the reported TCC data. Hence, among the different postprocessing methods those that contain some kind of a “logistic regression” core should be best. Here, we introduce MLR and POLR, which are two different, but closely related models.
MLR is a direct generalization of binary logistic regression. In the case of TCC, the sample space is restricted to the discrete observations of cloudiness that take values in . In the following, the different TCC states are denoted by , . For instance, refers to a clear sky. Hence, the MLR model has to assign probabilities to the different states of cloudiness based on raw ensemble statistics as predictors. Here, the raw ensemble comprises the 50 ENS runs as well as the HRES and CTRL runs (i.e., K = 52). Accordingly, the first three predictors in the MLR model are the mean of the ENS runs , the HRES run , and the CTRL run . Following Wilks and Hamill (2007), and Hamill et al. (2008) we link the ensemble spread to the MLR model using the ensemble variance as an additional predictor. The ensemble variance is given by
where denotes the ensemble mean. As the HRES run is technically not part of the ensemble, one could argue that the calculation of should be based only on the 50 ENS members and the CTRL run. However, tests have shown that forecast skill does not change when excluding HRES. Inspired by Scheuerer (2014) we have also tested the ensemble mean difference as a more robust alternative to the ensemble variance, which did not improve forecast skill. Again inspired by Scheuerer (2014) we introduce the predictors and , which denote the ratio of ensemble members equal to zero or one, respectively. For instance,
where denotes the indicator function. Selecting now a TCC state as a pivot and with the vector of predictors , the MLR model based on a random variable Z can be written as
where is the vector of coefficients for state . Though any state could be used as pivot state , we set . Then, the model in (3) has to be fitted J − 1 times such that the probabilities sum up to 1. Using a suitable training period, this model can be easily estimated using the function multinom of the R package nnet (Ripley and Venables 2014). In this study, two different MLR models are tested: a nonseasonal approach with blockwise training periods (MLR-B) and a seasonal approach with seasonal blockwise training periods (MLR-S).
POLR (Walker and Duncan 1967; McCullagh 1980; Ananth and Kleinbaum 1997) is an alternative to MLR. POLR is well suited for ordinal data like TCC. Since it assumes proportional odds, it requires fewer free parameters. This allows us to add an additional interaction term to the set of predictors used in the MLR model. This term represents the interaction between the ensemble variance and the deviation of from 0.5, where . The rationale behind this is to map to the variance of the postprocessed ensemble in a more natural way than in the MLR model. More specifically, the interaction term is defined as , where . This formulation is expected to shift extreme TCC forecasts toward the center, if is large, and at the same time is close to zero or one. Let be the cumulative predictive probability for TCC states. Then, the POLR model can be written as
where the coefficient takes a different value for each state and are strictly ordered. The coefficients do not change with state. Additionally, , , and are constrained to be nonnegative. This is ensured by estimating the model iteratively. In each iteration step negative estimates for , , or are set to zero and the model is reestimated without the corresponding predictor. This iterative procedure stops as soon as . As stated above, the assumption of proportional odds makes POLR much sparser than MLR. For the MLR model coefficients have to be estimated, where is the number of different states and p denotes the number of predictors not counting the intercept. In case of the POLR model we need only coefficients. As for MLR, the nonseasonal model is denoted as POLR-B, and its seasonal counterpart as POLR-S h, where h indicates that it is the full model with all predictors [i.e., ]. To find the best set of predictors, various POLR-S models with different sets of predictors are tested. They are listed in Table 1. POLR is implemented in the function polr of the R package MASS (Venables and Ripley 2002).
To allow numerically trouble-free verification, any forecast distribution , is slightly modified subsequent to model fitting. Namely, unrealistically low forecast probabilities for cloud cover state are avoided by setting to , where and T is the length of the training period. The parameter α denotes the probability that state j is observed at least once during a period of length T [i.e., ]. For this study, we deliberately set , which leads to for the nonseasonal models and for the seasonal models. In case of a forecast distribution with for at least one state , the probabilities have to be adjusted slightly such that . We apply this correction to all considered predictive distributions including the raw ensemble and the climatological forecasts.
e. Example forecasts
Before discussing the results in section 4, four subjectively selected example forecasts for Vienna, Austria, are presented in Fig. 1 to highlight typical properties of the postprocessing. Vienna was chosen as a location in Europe that is situated in the broad transition zones from maritime to continental in winter, and from Mediterranean to temperate in summer. As a result, it experiences a rich and complex cloud climatology that is additionally modulated by orographic effects due to its proximity to the European Alps. For illustrative purposes, raw ensemble forecasts are compared with the corresponding seasonal POLR forecasts that use the complete set of predictors (POLR-S h). A detailed discussion of the different POLR-S models can be found in section 4. The raw ensemble and POLR-S h bear strong resemblances. However, POLR-S h seems to move some weight from the extremes (0 or 8 octas) toward the more moderate levels of cloudiness (1–7 octas).
After having introduced the different forecast models, we first evaluate forecast skill of these models. This is followed by an in-depth assessment of calibration and sharpness of a selected set of models. For a fair comparison of verification scores, raw ensemble and postprocessed forecasts have to be mapped to the space of the observations. The function selected to map raw ensemble and postprocessed forecasts to the observation space influences most of the verification measures. Hence, it is important that the mapping function mimics the procedure of TCC observers, who have to give ⅛ as soon as a little cloud appears, even if the TCC is only 1%, and have to give ⅞ as soon as there is a little gap somewhere in the cloud layer. This is ensured by applying a nonequidistant mapping function, for which the details can be found in section a of the appendix.
a. Forecast skill
Average skill of the different TCC forecast models is assessed using the log score and the continuous ranked probability score (CRPS) averaged over the entire verification period and all stations. As TCC is a discrete variable, the ranked probability score (RPS; Epstein 1969; Murphy 1969) could be used instead of the CRPS. But since the ordered categories of TCC are not equidistant in the dataset at hand (see above and in section a of the appendix), RPS and CRPS would differ slightly. For this study, we have decided to use the CRPS, because it allows direct skill comparison with continuous TCC forecasts, which may become available in future (see also section 5). Both log score and CRPS are proper scoring rules that are negatively oriented (i.e., the lower the score the higher is the forecast skill). While the log score is a local scoring rule that takes only the forecast probability of the materializing observation into account, the CRPS is sensitive to distance in that forecasts with high probabilities attributed to values close to the materializing observation are considered to be skillful (Gneiting and Raftery 2007). Mathematical formulations of both scores are given in section b of the appendix. According to Table 2, the raw ensemble outperforms climatological and uniform forecasts in terms of CRPS for lead times of 1, 3, and 6 days, but not for 10 days. In terms of log score, it exhibits very poor performance irrespective of the forecast lag. All MLR and POLR models outperform the climatological, uniform, and raw ensemble forecasts in terms of log score and CRPS at all lead times. In case of MLR, the seasonal model slightly outperforms its nonseasonal counterpart in terms of CRPS, while the log score tends to prefer the nonseasonal model. For POLR, log score and CRPS are more consistent in that both scores indicate a slightly better skill of the seasonal model. This is also reflected in Fig. 2, which shows averaged log score and CRPS values including their associated 90% confidence intervals for the raw ensemble, MLR-B, MLR-S, POLR-B, and POLR-S h (i.e., the full model) (cf. Table 1). The 90% confidence intervals are obtained by block bootstrapping (Künsch 1989) with block resamples following a geometric distribution with mean , where is the length of the verification period. The block bootstrapping method is implemented in the R package boot (Canty and Ripley 2014). Comparing POLR-B with MLR-B and POLR-S h with MLR-S reveals a slight advantage of POLR over MLR. Additionally, POLR allows us to make a statement on the relative performance of the , , and runs. Because of the nonnegativity constraint, the estimates , , and can be interpreted as relative weights. As shown in Fig. 3, contributes most to the POLR forecast distribution over all lead times, while contributes the least. The high-resolution run shows a quite high average weight at the short lead times, but its importance decreases with increasing forecast lag. This is in line with the findings by Richardson et al. (2015) that the decreasing predictability leads to more need for the full ensemble distribution with increasing forecast lag. Note that we have also tested a POLR variant without any constraint on the coefficients. This approach did not only destroy the physical interpretability of the coefficients, it also did not lead to an improvement in forecast skill. Likewise, the coefficients of the MLR model cannot be interpreted easily. As forecast skill, physical interpretability, and model sparsity all favor POLR over MLR, the remainder of this paper focuses on POLR. Knowing that the seasonal POLR models perform best, the different seasonal POLR models are now compared. Comparing the models POLR-S a to POLR-S h it becomes clear that in addition to , , and the fraction of zero, , and complete, , TCC have to be included in the model. Models c, e, g, and h fulfill this requirement.
To assess the importance of and the interaction term I, we perform an in-depth comparison of predictive skill of the models c, e, g, and h. As the mean verification scores are almost equal, statistical testing is required in order to be able to make sound statements on relative model performances. To this end, a stationwise assessment of significant changes in CRPS and/or log score has been performed using block bootstrapping. To combine log score and CRPS, three cases are distinguished:
Deterioration: at least one of the two scores (CRPS or log score) is deteriorated, while the other is not improved.
No clear-cut difference: either both scores indicate no change in forecast skill or one of the two scores is improved, while the other is deteriorated.
Improvement: at least one of the two scores is improved, while the other is not deteriorated.
As we are comparing changes in CRPS and log score simultaneously, a correction for multiple comparisons has to be applied. We set the target type-I error to 0.05 (i.e., ). To achieve , a Bonferroni correction is applied (Bonferroni 1936). In the present example is used in the individual tests for changes in CRPS and changes in log score, respectively. As for the above confidence interval calculations, the block bootstrapped tests for significant changes in CRPS and/or log score are based on block resamples following a geometric distribution with mean . Models c, e, g, and h are now compared using a forward selection approach. As reported in Table 3 adding the interaction term I leads to greater improvements in skill than adding at a forecast lag of 3 days. The full model h with an additional inclusion of shows a slightly increased skill relative to model g. Hence, the full model h should be preferred in case of short forecast lags. At a forecast lag of 6 days no clear difference can be observed between the different model versions. At a very long lead time of 10 days the simplest model c seems to perform best. We subjectively select the full model h for the further analyses, because it performs best at the short lead times, which are also those with the highest predictability.
b. Calibration and sharpness
Keeping the improvement in skill by TCC postprocessing in mind, calibration and sharpness are now assessed in more detail. Calibration is the degree of statistical consistency between predictive distributions and observations, and is verified using the probability integral transform (PIT; Dawid 1984; Diebold et al. 1998; Gneiting et al. 2007). Figure 4 compares the PIT histograms of the raw ensemble, MLR-B, POLR-B, and POLR-S h predictions at forecast lead times of 3, 6, and 10 days. Flat PIT histograms indicate well-calibrated forecast distributions, whereas a shape is a sign of underdispersion, and a shape is a sign of overdispersion. Pooled over all stations, all postprocessed models are well calibrated. The raw ensemble forecasts are clearly underdispersive at a forecast lag of 3 days and only slightly underdispersive at a forecast lag of 6 days. At a forecast lag of 10 days the PIT histogram of the pooled raw ensemble forecasts is somewhat unclear. Nevertheless, it is still less well calibrated than the corresponding postprocessed forecasts.
Sharpness refers to how focused a forecast is (Gneiting et al. 2007) and is assessed here by an evaluation of the variances, and the widths of the centered 90% prediction intervals, pooled over all stations and verification days. As shown in Fig. 5, the raw ensemble provides the sharpest forecasts at a forecast horizon of 3 days. At lead times of 6 and 10 days the sharpness of the raw ensemble and the postprocessed forecasts is quite poor. However, all postprocessed models are sharper than the raw ensemble. This result is somewhat surprising in that statistical postprocessing improves both calibration and sharpness. Further insight into this can be obtained by assessing marginal calibration (Gneiting et al. 2007). A forecast is marginally well calibrated if the average predictive cumulative distribution function (CDF) over all verification days equals the empirical CDF of the observations. A marginally well calibrated forecast leads to a horizontal marginal calibration graph. Details on the marginal calibration graph can be found in section c of the appendix. Figure 6 shows such graphs for the climatological, the raw ensemble, and the POLR-S h forecasts for a selection of European stations with different TCC climate. As expected, the climatological forecasts show almost perfect marginal calibration. The raw ensemble exhibits poor marginal calibration, even though it is mapped to the observation space in a sound way (see above and in section a of the appendix). It assigns too much weight to TCC values of 0 or 8 octas irrespective of station and lead time. Brussels provides a good example of this. The most frequently observed TCC value is 7 octas. However, the raw ensemble assigns forecast weight rather to 8 octas as can be seen from the accentuated negative peak in the marginal calibration graph. POLR-S h performs as well as the climatological forecasts in terms of marginal calibration. Hence, postprocessing conveys a significant improvement in marginal calibration.
5. Discussion and conclusions
Both MLR and POLR prove to be useful methods for postprocessing of raw ensemble TCC forecasts. The results indicate that on average POLR with seasonally estimated model parameters performs best. This postprocessing method clearly improves forecast calibration. To achieve well calibrated forecasts, sharpness has to be reduced at the shorter forecast horizon of 3 days. But surprisingly, sharpness can be improved by postprocessing for the longer forecast lags of 6 and 10 days. Keeping in mind the paradigm stated by Gneiting et al. (2005, 2007) that the goal of statistical postprocessing is to maximize sharpness subject to calibration, the simultaneous improvement in calibration and sharpness is very desirable. This is mostly due to the tendency of the raw ensemble to assign too much weight to cloud cover states of 0 and 8 octas.
The methods presented in this study are designed to postprocess discrete TCC raw ensemble forecasts against SYNOP observations. Depending on the region, TCC observations are recorded automatically or manually, with different observation error characteristics. According to Mittermaier (2012) automated observations may underestimate the amount of high cloud (cirrus), while for human observers there is a tendency to underestimate cloud cover states of 0 and 8 octas. This may partly explain the poor marginal calibration of the raw ensemble when compared to the observations. However, a comparison of results at individual stations with manual observations and with automated observations did not reveal a systematic difference in marginal calibration. As human TCC observers are increasingly replaced by automated observations (Wacker et al. 2015), one would need to know the exact date at which a particular station has been changed from manual to automated for a more detailed analysis of this effect. Currently, SYNOP observations of total cloud cover are mainly automated in western Europe, North America, Australia, New Zealand, Japan, South Africa, and Antarctica. Because of the increasing number of automated stations, continuous TCC observations may become more widely available in the future. As the ECMWF raw ensemble provides TCC forecasts that are continuous on the unit interval, this would allow for continuous verification and postprocessing of TCC raw ensemble predictions and probably further enhance forecast skill. A continuous postprocessing method for predictions of visibility, which is a bounded variable like TCC, has already been implemented by Chmielecki and Raftery (2011).
TCC can be differentiated into low-, medium-, and high-level clouds. Predictive skill of NWP cloud cover forecasts can be different depending on cloud level. For instance, in the lowlands of the greater alpine region, the ECMWF HRES model underestimates persistent low stratus (Haiden and Trentmann 2016). It might be possible to reduce such systematic biases by cloud-level-specific postprocessing. Though a direct inclusion of low-, medium-, and high-level cloud forecasts as predictors in the POLR model [cf. (4)] did not lead to any improvement in forecast skill (results not shown in this paper), further analyses may be beneficial. In particular, a separate postprocessing of each cloud level with training observations differentiated according to cloud level may further increase forecast skill.
To summarize, considering the global set of SYNOP stations covered by this paper, postprocessing of discrete TCC raw ensemble predictions using readily available methods can improve forecast skill significantly. Hence, postprocessing helps to improve the generally low predictive performance of raw ensemble TCC forecasts. Additionally, this study identified the seasonal POLR model as the most skillful TCC postprocessing approach.
S. Hemri gratefully acknowledges the support of the Klaus Tschira Foundation. Furthermore, we are grateful to D. S. Richardson and B. Ingleby of the ECMWF for helpful discussions and inputs. We like to thank T. Gneiting of the Heidelberg Institute for Theoretical Studies (HITS) and of the Institute for Stochastics at Karlsruhe Institute of Technology for valuable inputs and comments. Furthermore, we are grateful to M. Scheuerer of the NOAA/ESRL who performed preliminary analyses on postprocessing of TCC forecasts during his short stay at ECMWF. Last but not least, we like to thank those who made comments to our presentations on postprocessing of TCC forecasts at the workshops on statistical postprocessing of ensemble forecasts at the HITS in July 2015 and on forecast calibration/verification at the ECMWF in August 2015. Finally, we are grateful to the two anonymous reviewers for their helpful comments.
Methods for Verification
a. TCC mapping
The SYNOP observations dataset at hand reports TCC states as values in . Obviously, CRPS, log score, forecast variance, and the width of the 90% prediction interval are affected by the choice of the verification space. The ECMWF TCC raw ensemble forecasts are continuous in . The postprocessed MLR and POLR forecasts are given in nine ordered categories, which can be considered octas. Raw ensemble and postprocessed forecasts are mapped to Z according to Table A1.
b. Log score and CRPS
Assuming a TCC forecast distribution F and a corresponding observation z, then the log score can be written as
where is the probability assigned to TCC state z by F. The CRPS is given by
where Z, are independent random variables with finite support (Gneiting and Raftery 2007). If F is a discrete probabilistic TCC forecast on the observed space, which is described in section a of the appendix, the CRPS can be calculated using the following:
c. Marginal calibration
Let be the predictive CDF for verification day υ in verification period V, then the average predictive CDF for TCC can be written as
and the empirical CDF of the observations as
For a marginally well-calibrated forecast the graph of describes a horizontal line at zero (Gneiting et al. 2007).