1. Introduction
Uncertainties in numerical weather predictions (NWP) have many different sources but may be subdivided into two major categories. One part of uncertainty results from the imperfect knowledge of the initial and boundary conditions and the nonlinear nature of the atmospheric system (Lorenz 1993). Ensemble prediction systems account for this uncertainty by estimating a probability distribution of the future weather evolution (Palmer 2000). The second part of uncertainty is caused by the incorrect representation of physical processes by numerical models. This is due to the coarse spatial resolution and the resulting necessity to parameterize subgrid-scale meteorological phenomena (Palmer 2000). This deficiency can lead to substantial biases in the predicted quantities.
Numerous postprocessing techniques have been developed to statistically correct for biases in numerical weather prediction model parameters. Assuming no significant changes in climate, those techniques usually compare a set of past model forecasts with observations in order to identify systematic relationships that can be used to correct the current forecast operationally. The successful application of various ensemble calibration techniques has been shown in a comparison study by Wilks (2006a) as well as Wilks and Hamill (2007). However, those techniques are not designed to calibrate rare events, especially rare precipitation events, as the sample size of the training datasets (some days to weeks) tend to be statistically insufficient. More adequate approaches to calibrate rare events involve retrospective forecast (reforecasts) from a longer time period, typically some decades, providing a better sampling of the distributions in question. Several evaluations using reforecasts to calibrate global ensemble prediction systems (EPS) exist (Hamill et al. 2004, 2006; Hamill and Whitaker 2006; Wilks and Hamill 2007; Hamill et al. 2008; Hagedorn et al. 2008). These studies show that the use of reforecasts can lead to clear improvements of the forecast skill. They are, at least in case of precipitation forecasts, superior to calibration techniques using a short training period (Hamill et al. 2008) and, in addition, give insight into model weaknesses.
For a quantitative, reforecast-based calibration on the model domain, without loss of horizontal resolution, a dense network of observations is required and must be accessible to the producer of the forecast. Also, the observations must cover the same time span as the reforecasts. In practice, this is rarely available for the majority of model output parameters. Therefore, calibration techniques are often restricted to stations or subdomains of the model area. To circumvent this, the use of model analysis fields instead of observations could be an appealing possibility. Yet, it cannot be assumed that analysis fields are bias free and as a result, the calibrated forecast would not be bias free either. An alternative approach to forecast calibration is the calculation of the extreme forecast index (EFI; Lalaurette 2003; Zsoter 2006), currently in operational use at the European Centre for Medium-Range Weather Forecasts (ECMWF). This index is a measure of the difference between the cumulative distribution function (CDF) of model climatology and the actual ensemble forecast. It is used as an indicator of how exceptional the actual forecast is compared to the available history of the model. Assuming that the recurrence time of a forecast event with respect to the model climatology is similar to the recurrence time of the observed event with respect to the observation climatology, the EFI is a calibrated forecast product independent of the availability of observation data. However, the interpretation of the EFI is not unique, different characteristics of the climatology and forecast CDF can result in the same index. Here, the basic idea of the EFI is extended. A calibrated, probabilistic warning product, based on the return periods of forecast events is proposed. Instead of a single number, the forecast consists of an ensemble of return periods or a probability to exceed a return period. The return periods are estimated from the model climatology and are therefore, like the EFI, not subject to systematic model errors. Products inferred from return period–based forecasts are especially useful for warning platforms such as the Mesoscale Alpine Programme Demonstration of Probabilistic Hydrological and Atmospheric Simulation of Flood Events (MAP-DPHASE) platform (Rotach et al. 2009), where raw model output was used to give warnings of extreme events without any additional bias correction from forecasters.
For this study a 30-yr-long reforecast climatology, created with the Consortium for Small Scale Modeling Limited Area Ensemble Prediction System (COSMO-LEPS; Montani et al. 2003; Marsigli et al. 2005) is used to calibrate the ensemble forecasts of the same model. COSMO-LEPS has been shown to be valuable in probabilistic flood forecasting (Marsigli et al. 2004a; Verbunt et al. 2007). Here, it is investigated to what extend those forecasts can be improved by means of calibration. The focus is on forecasts of 24-h total precipitation sums over Switzerland. These forecasts are particularly error prone in orographically complex regions (Buzzi and Foschini 2000; Ferretti et al. 2000; Cherubini et al. 2002) and the results are therefore not representative of the entire model domain. Calibrating precipitation forecasts is a challenging task, especially when considering short accumulation times. This results in heavily skewed data distributions, consisting of many zero precipitation events and only a few cases of heavy precipitation. Furthermore, it cannot be expected that calibrating precipitation at high horizontal resolution results in improved forecasts. As the correct position of precipitation becomes increasingly important with finer model grids, those errors might dominate the bias error.
The paper is structured as follows. Section 2 presents the reforecast dataset as well as the observation data used for verification. To motivate the calibration of the forecast system, the systematic precipitation errors are illustrated in section 3. Section 4 introduces the method of calibration as well as the used verification methods. A new warning product designed for calibrated forecasts is presented in section 5. The skill improvement and the results of some sensitivity studies on the calibration dataset are shown in section 6 followed by a discussion and concluding remarks.
2. Data
a. COSMO-LEPS forecasts
Since February 2006, COSMO-LEPS, the limited-area ensemble of COSMO, operationally delivers a 16-member ensemble forecast with 132-h lead time for large parts of Europe, initialized once a day at 1200 UTC (Montani et al. 2003; Marsigli et al. 2005). The model has a horizontal mesh width of 0.09° (approximately 10 km), 40 vertical model levels, and 306 × 258 grid points in longitude (−12.15°W–36.16°E) and latitude (32.29°–57.13°N), respectively. The associated initial and boundary conditions are taken from the ECMWF EPS, providing a global 51-member ensemble initialized twice a day. A clustering algorithm using geopotential height, wind components (u, υ), and humidity over Europe divides the 102 forecast members of two subsequent EPS runs into 16 clusters (Molteni et al. 2001; Marsigli et al. 2008). From each EPS cluster, one representative member is chosen to provide the initial and boundary conditions for driving a limited-area model. To account for model uncertainties in COSMO-LEPS, two convective schemes, either the Tiedtke (Tiedtke 1989) or the Kain–Fritsch (Kain 2004) scheme, are used randomly for each member. Here, the COSMO-LEPS 24-h precipitation forecasts from 0600 to 0600 UTC are examined, in sympathy with the standard daily sum of precipitation observations. This limits the maximum lead time to 114 h.
For some applications the COSMO-LEPS forecast members are weighted according to the size of the underlying ECMWF clusters. We do not use this weighting here as the properties of clusters from a global low-resolution model might not be transferable to the regional ensemble (Brankovic et al. 2008) and earlier verification studies of COSMO-LEPS have shown no improvement of the precipitation prediction performance in the Alpine region when weighting the members according to the size of the cluster they represent (Marsigli et al. 2004b).
b. COSMO-LEPS reforecasts
The model climatology consists of a set of 30 yr of daily precipitation reforecasts from 1971 to 2000. To get as much statistically independent reforecasts as possible for a given amount of computational resources, we have decided to build a long reforecast period using only the control run rather than using a shorter period with more than one reforecast member. Each reforecast uses 40-yr ECMWF Re-Analysis (ERA-40) data as initial and boundary conditions and is run up to a lead time of 42 h starting at 1200 UTC every day. For the model climatology, precipitation sums from 18–42 h are used to avoid spinup effects of the model. This model climatology is used to calibrate forecasts of all lead times, although forecasts of longer lead times might require a longer lead-time reforecast dataset. As longer lead-time reforecasts are not available, we assume no substantial change in the model bias with growing lead time.
Discrepancies in the statistical properties of the COSMO-LEPS forecast and climatology could arise from the use of the Tiedtke convection scheme alone for the reforecasts, whereas the operational forecast randomly switches between the Tiedtke and the Kain–Fritsch convection scheme. This is expected to impact the precipitation calibration but, as will be seen in section 6, it does not obliterate an increase in the skill of the calibrated forecasts.
As COSMO-LEPS is nested on a medium range EPS it is particularly useful in the early to medium range (3–5 days; Marsigli et al. 2008). Therefore it is possible, especially for short lead times, that the forecasts are overconfident (i.e., have too small spread) and thus, a higher-order calibration (e.g., the additional calibration of the forecast spread) would be beneficial (e.g., Weigel et al. 2009). However, the creation of an ensemble reforecast dataset would be computationally very expensive.
c. Observation climatology
For the verification of calibrated COSMO-LEPS forecasts, an observational dataset of 24-h total precipitation sums (0600–0600 UTC) over Switzerland was composed. According to the model climatology, an observation climatology from 1971 to 2000 on the COSMO-LEPS grid was set up. Rain gauge measurements over Switzerland were interpolated on the COSMO-LEPS grid using the synergraphic mapping system (SYMAP) method (Frei and Schär 1998; Frei et al. 2006). The number of rain gauges available to create the gridded climatology has changed over time, yet never falling below 450 stations for 417 grid points (Fig. 1). A minimum of two stations near a grid point was used to interpolate from point observation data to the model grid. For each grid point, stations within a range of 2–4 times the mesh width were searched. The stations were interpolated on the COSMO-LEPS grid, whereas the weight of an observation decreased with increasing distance from the grid point. This procedure introduces some spatial autocorrelation in the observation data. For verification purposes, this is not supposed to be critical. The effective resolution of the forecast system, and hence the ability to resolve meteorological phenomena, is lower than the horizontal mesh width and, therefore, lower than the interpolated observations.
d. Verification data
The 24-h precipitation observations used for the verification of the calibrated forecast system were composed using the same method as for the observation climatology. The period of April 2006–November 2007 has been chosen for verification as during this time, only minor changes have been introduced to the model.
3. Systematic model errors
The main objective for the calibration of model forecasts is to correct for systematic model errors. The bias in the COSMO-LEPS model precipitation is significant as it is shown here for the region of Switzerland. For a forecast error analysis, the reforecast climatology and the correspondingly gridded observation climatology are used to calculate the difference in the 80th and 95th quantile of 24-h precipitation. Table 1 shows the relationship between quantiles and return periods used in this study. Quantiles of 24-h total precipitation can easily be expressed as return periods. If only data from a certain period are used, the return periods relate to this period, not the entire database. The quantiles where chosen such that a sufficient database for the verification is provided.
To compare seasonal performances, the error was standardized [i.e., divided by the mean precipitation taken from the two climatologies (reforecast and observation)]. Figure 2 shows the model bias for January and July in the period 1971–2000. Precipitation is overestimated in most parts of Switzerland, in particular along the northern and southern slopes. The bias is larger in January than in July, and relatively more pronounced for more frequent events (left panels of Fig. 2). Small regions of underestimated precipitation can be identified north of the Alps in January and in the central Alps in July. All together, sign, amplitude, and pattern of the model bias are of a complex structure and vary with season. This makes it very difficult for operational forecasters to develop an intuition of the model quality. Also, it becomes obvious, that a calibration must take location, intensity, and season into account.
4. Method
a. Calibration method
The method used to calibrate COSMO-LEPS forecasts with reforecasts can be formalized as an empirical transformation of the forecast into the probability space of the underlying reforecast climatology. For each forecast member, the return period is calculated with respect to the model climatology. Return periods derived from an observation climatology would be too long in case of a positive bias or too short in case of a negative bias. On the contrary, reforecast-based return periods are supposed to be free of systematic biases. The basic assumption underling this method is that a forecast event, having a certain return period w.r.t. the model climatology, has the same return period w.r.t. a climatology made of observations. The forecast itself might be biased, but the forecast transformed to a return period based on the model climatology should be bias free. In other words, if a forecast event has rarely occurred in the model climatology, the actually happened event should have been rarely observed, too. The new forecast is merely an ensemble of return periods, therefore it does not require any observational data. On the contrary, traditional calibration methods, based on a relation between observations and the model history, are capable of giving calibrated, quantitative forecasts indeed, but only if according observations are available.
Probabilities for the actual COSMO-LEPS forecast are then derived directly from the number of forecasts exceeding a certain return period (e.g., return periods corresponding to warning levels). Each forecast member was attributed an equal probability of
Different strategies can be followed in order to find a good compromise between reforecast sample size, hence operational feasibility, and accuracy in the estimation of recurrence times. Using only a shorter subset (e.g., a window around the actual forecast date) of the model climatology allows for an earlier start of the calibration after beginning with the reforecast calculations. Yet, the calibration with less reforecasts might inflict inaccuracies in the estimate of return periods. The length of the seasonal subset determines the level of dynamic properties in the forecast variable’s distribution used for calibration. When using a short subset, the seasonal variability of the forecast variable is followed closely and thus the actual forecast and the climatology should have very similar statistical properties. A drawback is the comparatively low amount of available data, resulting in large uncertainties of the estimation of return periods, especially for very rare events. A compromise could be a longer (e.g., seasonal) subset of the model climatology consisting of the 3 months of the season in which the actual forecast is valid. The higher confidence in the estimation of return periods comes along with a potentially lower representativeness of the model climatology for the forecast. Assuming that the seasonal variability of the forecast variable of interest is not strongly pronounced one could even extend the model climatology to half a year or the entire available set of reforecasts. We performed a sensitivity study (section 6b) to evaluate the effect of changing the seasonal subset of the model climatology used for calibration on the forecast performance. For this, different subsets of the model climatology are utilized, ranging between 10 and 300 days from each of the 30 reforecast years. The window is always centered on the initial date of the actual forecasts. The operational products presented in section 5 were calibrated using a window of ±14 days around the actual forecast initial time only. Also, their verification (section 6) is based on this window.
A second part of the sensitivity study (section 6b) investigates the change in the performance of the calibration when using only 1, 2, 5, 10, or 20 yr of reforecasts compared to the entire 30 yr available. This would save CPU time and thus would make the calibration less expensive. For the calibration with shorter reforecast, a 30-day subset is used.
b. Forecast verification
To quantify the effect of the calibration on the skill of COSMO-LEPS precipitation forecasts, the performance of the calibrated system compared to the uncalibrated system is analyzed. The evaluation of the model forecast is done with a dense and consistent set of observations, interpolated on the model grid over Switzerland. Four different lead times are analyzed, ranging from 18–42, 42–66, 66–90, to 90–114 h, whereas the forecasts are spanning from 0600 to 0600 UTC, in agreement with the observation data. Because of the correction of the systematic model error, we expect the main effect of our calibration in the reliability of the forecast system (Wilks 2006b). To quantify the performance of our calibration method, the attributes diagram and the Brier skill score (BSS) are used. Note that the verified events are based on quantiles and therefore different for each grid point. As they share the same climatological frequency, a false skill introduced by a varying climatology is avoided (Hamill and Juras 2006). This is also a necessary requirement for the correct interpretation of the attributes diagram.


The attributes diagram (a reliability diagram including “no skill” and “no resolution” lines) compares the forecast probability and the observed frequency of an event, giving a detailed insight into the way the calibration works. A perfect calibration would drag the reliability curve on the “perfect reliability” line in the diagram. If the forecast is unable to identify events with higher or lower probability than the climatological frequency, the reliability curve would equal the no resolution line. Points closer to the perfect reliability line than to the no resolution line, indicated by the no skill line, contribute positively to the BSS (Wilks 2006b).
To test whether a BSSD is significantly positive and whether two scores of different experiments are significantly different, the bootstrapping technique (Efron 1992) is applied. Because the forecast and observation fields are spatially correlated, the grid points of the verification domain cannot be considered as independent (although the domain could probably be divided into several uncorrelated domains in order to increase sample size) and resampling every grid point for a test statistic would result in overconfident estimates. The spatial correlation was accounted for by resampling only the verification days while retaining the spatial structure (Livezey 2003; Candille et al. 2007; Hamill 1999; Wilks 2006b). This leads to a conservative estimate of the confidence intervals. To reveal differences between distributions of two bootstrapped scores (e.g., calibrated and raw forecasts), the nonparametric Wilcoxon signed rank test for related samples (Wilks 2006b) was applied.
5. Visualization
For applications in operational services, visualization is a key issue. We designed a product that visualizes the warning of extreme events as warngrams in which the probability to exceed different quantiles is plotted against the forecast lead time. Figure 4 is an example warngram of the forecasts initialized at 1200 UTC 2 May 2007 for the grid point closest to Zurich. In this warngram, each bar represents a 24-h total precipitation forecast of exceeding a quantile, gliding in steps of 3 h. The different colors of the bars show the different return periods, with more intense colors for less frequent events. Also given, if available, is the amount of precipitation corresponding to the return period as obtained from the observation climatology (here in brackets in the legend). At forecast day 3 the warngram shows a ∼45% probability to exceed an event of 29 mm (24 h)−1 as it would happen in 1 out of 3 Mays. The actual quantity of precipitation on 5 May 2007 was 31.9 mm (24 h)−1.
One advantage of this warngram is that the information of forecast precipitation events is directly presented as warning levels, the most common way to issue a warning of extreme events. Those warning levels are usually based on estimates of return levels from observations and therefore error prone. The warngram presented here is based on return levels taken from the reforecast dataset and is therefore free from systematic model errors [i.e., (more) reliable]. The warngram additionally provides a probabilistic information (i.e., the confidence of the forecast system in the forecast as the height of the bars). This gives the user the opportunity to decide whether or not to take action, depending on his/her sensitivity on false alarms and missed events, respectively.
6. Verification results
a. Warning product verification
The verification was carried out for the period of April 2006–August 2007. During this period COSMO-LEPS was run as a 16-member system without major changes in the model version. To investigate the seasonal difference in the model behavior, winter (November–December 2006 and January–February 2007) and summer (June–July–August 2006 and 2007) months were verified separately.
The attributes diagram of the uncalibrated COSMO-LEPS forecasts (Fig. 5, gray lines) reveals the weaknesses of the forecast system for winter and summer months as well as for short (42 h) and long (90 h) lead times. The difference between summer and winter months can be clearly seen. The uncalibrated forecasts of the winter months are less reliable (higher reliability term). This was already expected because of the larger systematic error in winter (Fig. 2). The reliability lines for forecast probabilities greater than 50% are somewhat closer to the perfect reliability line in the winter months, pointing to slightly less overconfident forecasts for longer lead times. A significance test of the single probability bins shows that there is almost no overlap of the reliability line of the uncalibrated forecast with the perfect reliability line. Only forecast of low probabilities (≤50%) in summer can be considered as reliable. A comparison of the reliability lines with the no skill line gives an insight into the effect on the Brier score. Obviously, the unreliable forecasts especially in the winter months will decrease the skill of the forecast system. We refrain from showing the attributes diagram for the entire verification period, as it approximately shows a blend of the winter and summer diagrams and only gives a little additional information.
A calibrated forecast system increases the reliability, especially of forecasts in winter months (Fig. 5, black lines). Calibrated forecasts in winter are much closer to the perfect reliability line. For longer lead times, most of the forecast probabilities are insignificantly different from the perfect reliability line. For short lead times in winter there is still a significant difference between forecast and perfect reliability. Regarding the no skill line, it can be seen that the calibrated forecast contribute positively to the Brier score compared to a climatological forecast. Forecasts in the summer months are also closer to perfect reliability or, for long lead times not distinguishable from the prefect reliability line. Yet, as the uncalibrated forecast already had a relatively high reliability the improvement is somewhat less salient.
Note that the error bars of the reliability lines should be used to infer the significance of the differences between forecasts and the perfect reliability line. If the significance of the improvement by calibration is to be determined, the difference between both (calibrated and uncalibrated) bootstrapped distributions has to be tested. Hence, overlapping error bars do not necessarily imply an insignificant improvement by calibration. Even if they overlap, the expectation values of the distributions might be significantly different.
Besides the notable improvement of forecast reliability by using a calibration based on reforecasts, an interesting property of the attributes diagrams needs to be mentioned. Forecasts for longer lead times appear to be, in all seasons, more reliable than forecasts for short lead times. A possible explanation could be the setup of COSMO-LEPS, which is designed for the medium range (Marsigli et al. 2005, 2008). At short lead times, the forecast system has too little spread (overconfidence) and the calibration method used here is not meant to correct for that.
Supplementary to the attributes diagrams, Table 2 gives the Brier scores and their reliability and resolution terms for uncalibrated and raw forecasts. It can be seen, that the Brier score (negatively oriented) is improved strongly by calibrating the system. This can be mainly attributed to improvements in the reliability term (negatively oriented), which is almost corrected to a perfect reliability of zero. The resolution term (positively oriented) is not affected that strongly, still, improvements of about 5%–10% can be seen. This effect is not expected when using simple calibration methods that just draw forecast probabilities toward observed frequencies known from a verification period. It shows that the calibration method used here is capable of improving the forecast systems’ ability to separate different categories. A reason for the gain in resolution might be the strong bias reduction and, therefore, the assignment of forecasts in new categories that have not been occupied before.
To quantify the effect of calibration for different lead times and thresholds, we show the skill scores for lead times from 42 up to 114 h and events with return periods of 5, 20, and 40 days (w.r.t. a window of 30 days around the actual forecast date). Again, the results were subdivided into winter and summer (Figs. 6a,b). The uncalibrated forecast system is skillful in both seasons w.r.t. the lower threshold (Q0.8) and its skill is decreasing with lead time. Uncalibrated forecasts of higher precipitation events (Q0.95 and Q0.975) are less skillful in summer and show no skill in winter. The calibrated COSMO-LEPS forecasts show significantly higher skill scores for all lead times and for all thresholds. The strongest improvement can be achieved for forecasts issued during the winter months: the relative improvement in skill is about 60% for the lower threshold. Initially unskillful forecasts for the higher threshold can be turned into skillful forecasts by calibration. During the summer season, the forecast skill can be improved by about 10% through calibration.
The lower increase of skill for the summer months can be explained with the more convective and localized nature of precipitation events in this season. During summer, systematic errors in the forecast system might be less important than spatial errors that cannot be corrected for with our calibration method. However, as can be seen in Table 2, the forecast system is already reliable in summer and there is not much potential left for the calibration.
To get an insight into the spatial distribution of skill (BSSD) improvement, the relative difference of the raw and the calibrated COSMO-LEPS forecasts were calculated for lead times of 42 and 114 h as well as for the 80th and 95th quantile for each grid point of the verification domain separately (Fig. 7). A significant improvement in forecast skill cannot be found at all grid points. This is mainly due to a low relative difference at these points, the small sample size (518 days) and, therefore, a rejected significance test (on 95% confidence intervals). Overall, the pattern is very similar to the distribution of the relative model biases (Fig. 2). In regions with the largest bias (central and southern Alps), the improvement by calibrating the forecast is largest. In the lower-altitude regions of the verification domain (showing only little systematic errors in precipitation) the skill improvement is either low or not significant. Also, a number of grid points with initially no or negative skill for uncalibrated forecasts became skillful grid points after calibration (black dots in Fig. 7).
b. Sensitivity study
For potential users of reforecasts it is important to find a reasonable balance between the size of the climatology and the improvement of forecast skill. We tested two possibilities of calibrating with a reduced set of reforecasts.
First, the length of the seasonal subset of the climatology was changed. Instead of 30 days, 10–300 days with the actual forecast initial date as center were chosen as a window size. In practice, this means that after, for example, a change in the forecast model version it is not mandatory to build the model climatology over the entire year but at first over the subset only. As expected, Fig. 8 shows that the optimal choice of the subset depends on the frequency of the event. Frequent events of Q0.8 can be very well calibrated by using only a small subset of the reforecasts. The forecast skill does not increase substantially if the climatology is extended to larger subsets. For less frequent events, an extension of the seasonal subset increases the forecast skill. For the rarest event verified here (Q0.975), a saturation is reached using a subset of about 150 days around the date of interest. A further increase of the used climatology to 300 days around the forecast initial date results in a degradation of forecast skill. A reason could be the mixing of climatological data from different seasons resulting in a calibration with data having different precipitation characteristics. For the upcoming analysis a subset of 30 days around the initial date of each forecast is chosen for calibration as, in operational service, this presents a reasonable compromise of the gain in forecast skill and the costs of producing calibrated products.
Varying the seasonal subset of the model climatology does not save any CPU time in the long term. Therefore, the effect of reducing the reforecast years on the performance of the calibration was investigated. Figure 9 shows the BSSD for the event thresholds of 5-, 20-, and 40-day return periods when varying the reforecast length between 1 and 30 yr. To avoid any bias in the BSSD by picking always the same, especially good/bad, years to calibrate with, for each forecast, the years were sampled randomly out of the original 30-yr climatology. Not surprisingly, the calibration with only 1–2 out of 30 possible years from the model climate results in a significantly lower BSSD than a forecast calibrated with all available years of reforecasts. Using such short sets of reforecasts for the calibration results in an even lower forecast skill than an uncalibrated forecast. Extending the reforecast period to more than 20 yr does not result in a higher forecast skill for the thresholds investigated here. The results suggest a large set of reforecasts for the calibration of extreme forecasts to be able to get a good estimate of return periods, in line with our expectations. However, if one is interested in a good calibration of frequent events (e.g., 20-day return period or lower) using more than 15 yr of reforecasts does not result in a significant increase of the forecast skill.
7. Discussion and conclusions
A good calibration of numerical weather predictions relies on a comprehensive observation database. This is especially true for high-resolution precipitation forecasts. However, observation series of a sufficient spatial and temporal resolution are rarely available. Evidence has been given, which for COSMO-LEPS precipitation forecasts, a considerable improvement in skill can be achieved by relating the forecasts to a model climatology rather than to observations. This approach assumes that the recurrence time of a forecast event in the model climate is similar to the recurrence time of the event in the observation climatology. Thus, by expressing the forecast in quantiles (or return periods), existing systematic model errors are corrected while bypassing the need for observations. The calibrated forecasts show an improved reliability for all analyzed precipitation events.
The computational costs of establishing a model climatology of reforecasts are considerable and have been identified as a major drawback of this calibration method. The sensitivity of the calibration performance on the length of the underlying model climatology has been analyzed by means of two different studies. First, in varying the window width of the time series the calibration is based on a subset of about 30 days around the actual forecast day has been found to be a good compromise of gaining forecast skill and being able to start the calibration as soon as possible after a change in the NWP model. Second, the effect of reducing the number of years of reforecasts used for calibration was investigated. It has been shown that the potential of saving computational effort is highly dependent on the event’s threshold the forecast user is interested in. If an accurately calibrated model is required for relatively low event thresholds, a small set of reforecast is sufficient. Users requiring well-calibrated forecasts of rare events should also invest into creating a larger model climatology. Here, we have shown that about 20 yr of reforecasts are sufficient to improve the forecast skill for all considered thresholds.
The considerable improvement in forecast skill when calibrating even with a small climatology should motivate operational weather services to invest in the production of reforecasts. The method to calibrate the forecasts can be implemented easily in the operational forecast cycle and, as it does not require any observational data, every forecast parameter can be calibrated over the entire model domain. However, the effectiveness of the presented calibration for other parameters besides precipitation remains to be shown and the presentation of a forecast in form of return periods will not be reasonable for every parameter.
Further investigations about the benefit of a calibration are required. In particular, the benefit should be quantified relative to other approaches minimizing the model bias with only a few recent observations. It could also be tested, whether the number of forecast members can be reduced (e.g., using only a 10-member ensemble instead of the original 16 members) in order to save CPU time, which could than be used for the composition of the reforecasts. Finally, the impact of the convection schemes on the calibration, especially during summer months should be evaluated.
One interesting extension of the calibration method will be to use extreme value statistics for the estimation of return levels from reforecast data. This way, probabilities for the exceedance of extremely rare events could be given, although their return period is greater than the length of the reforecast period. Also, the quality of calibrated forecasts using a reduced and therefore less expensive set of reforecasts in association with applying extreme value statistics could be assessed.
Acknowledgments
This study was funded by the Swiss National Centre for Competence in Research Climate (NCCR Climate), a research instrument of the Swiss National Science Foundation. The reforecasts have been conducted on the HPC facility at the European Centre for Medium-Range Weather Forecasts (ECMWF), Reading, United Kingdom. We thank two independent reviewers for their valuable comments on an earlier version of this paper.
REFERENCES
Brankovic, C. , B. Matjacic , S. Ivatek-Šahdan , and R. Buizza , 2008: Downscaling of ECMWF ensemble forecasts for cases of severe weather: Ensemble statistics and cluster analysis. Mon. Wea. Rev., 136 , 3323–3342.
Buzzi, A. , and L. Foschini , 2000: Mesoscale meteorological features associated with heavy precipitation in the southern alpine region. Meteor. Atmos. Phys., 72 , 131–146.
Candille, G. , C. Cote , P. L. Houtekamer , and G. Pellerin , 2007: Verification of an ensemble prediction system against observations. Mon. Wea. Rev., 135 , 2688–2699.
Cherubini, T. , A. Ghelli , and F. Lalaurette , 2002: Verification of precipitation forecasts over the alpine region using a high-density observing network. Wea. Forecasting, 17 , 238–249.
Efron, B. , 1992: Jacknife-after-bootstrap standard errors and influence functions. J. Roy. Stat. Soc., 54B , 83–127.
Ferretti, R. , T. Paolucci , W. Zheng , G. Visconti , and P. Bonelli , 2000: Analyses of the precipitation pattern on the alpine region using different cumulus convection parameterizations. J. Appl. Meteor., 39 , 182–200.
Frei, C. , and C. Schär , 1998: A precipitation climatology of the Alps from high-resolution rain-gauge observations. Int. J. Climatol., 18 , 873–900.
Frei, C. , R. Schöll , S. Fukutome , J. Schmidli , and P. L. Vidale , 2006: Future change of precipitation extremes in Europe: Intercomparison of scenarios from regional climate models. J. Geophys. Res., 111 , D06105. doi:10.1029/2005JD005965.
Hagedorn, R. , T. M. Hamill , and J. S. Whitaker , 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136 , 2608–2619.
Hamill, T. M. , 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14 , 155–167.
Hamill, T. M. , and J. Juras , 2006: Measuring forecast skill: Is it real skill or is it the varying climatology? Quart. J. Roy. Meteor. Soc., 132 , 2905–2923.
Hamill, T. M. , and J. S. Whitaker , 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134 , 3209–3229.
Hamill, T. M. , J. S. Whitaker , and X. Wei , 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts. Mon. Wea. Rev., 132 , 1434–1447.
Hamill, T. M. , J. S. Whitaker , and S. L. Mullen , 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87 , 33–46.
Hamill, T. M. , R. Hagedorn , and J. S. Whitaker , 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136 , 2620–2632.
Kain, J. S. , 2004: The Kain–Fritsch convective parameterization: An update. J. Appl. Meteor., 43 , 170–181.
Lalaurette, F. , 2003: Early detection of abnormal weather conditions using a probabilistic extreme forecast index. Quart. J. Roy. Meteor. Soc., 129 , 3037–3057.
Livezey, R. E. , 2003: Categorical events. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley, 77–96.
Lorenz, E. N. , 1993: The Essence of Chaos. University of Washington Press, 227 pp.
Marsigli, C. , A. Montani , F. Nerozzi , and T. Paccagnella , 2004a: Probabilistic high-resolution forecast of heavy precipitation over Central Europe. Nat. Hazards Earth Syst. Sci., 4 , 315–322.
Marsigli, C. , F. Boccanera , A. Montani , F. Nerozzi , and T. Paccagnella , 2004b: COSMO-LEPS verification: First results. COSMO Newsletter, 4 .[Available online at http://www.cosmo-model.org/content/model/documentation/newsLetters/default.htm].
Marsigli, C. , F. Boccanera , A. Montani , and T. Paccagnella , 2005: The COSMO-LEPS mesoscale ensemble system: Validation of the methodology and verification. Nonlinear Processes Geophys., 12 , 527–536.
Marsigli, C. , A. Montani , and T. Paccangnella , 2008: A spatial verification method applied to the evaluation of high-resolution ensemble forecasts. Meteor. Appl., 15 , 125–143.
Molteni, F. , R. Buizza , C. Marsigli , A. Montani , F. Nerozzi , and T. Paccagnella , 2001: A strategy for high-resolution ensemble prediction. I: Definition of representative members and global-model experiments. Quart. J. Roy. Meteor. Soc., 127 , 2069–2094.
Montani, A. , and Coauthors , 2003: Operational limited-area ensemble forecasts based on the Lokal Modell. ECMWF Newsletter, No. 98, ECMWF, Reading, United Kingdom, 2–7.
Murphy, A. H. , 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595–600.
Palmer, T. N. , 2000: Predicting uncertainty in forecasts of weather and climate. Rep. Prog. Phys., 63 , 71–116.
Rotach, M. W. , and Coauthors , 2009: MAP D-PHASE: Real-time demonstration of weather forecast quality in the alpine region. Bull. Amer. Meteor. Soc., 90 , 1321–1336.
Tiedtke, M. , 1989: A comprehensive mass flux scheme for cumulus parameterization in large-scale models. Mon. Wea. Rev., 117 , 1779–1800.
Verbunt, M. , A. Walser , J. Gurtz , A. Montani , and C. Schär , 2007: Probabilistic flood forecasting with a limited-area ensemble prediction system: Selected case studies. J. Hydrometeor., 8 , 897–909.
Weigel, A. P. , M. A. Liniger , and C. Appenzeller , 2007: The discrete Brier and ranked probability skill scores. Mon. Wea. Rev., 135 , 118–124.
Weigel, A. P. , M. A. Liniger , and C. Appenzeller , 2009: Seasonal ensemble forecasts: Are recalibrated single models better than multimodels? Mon. Wea. Rev., 137 , 1460–1479.
Wilks, D. S. , 2006a: Comparison of ensemble-MOS methods in the Lorenz ’96 setting. Meteor. Appl., 13 , 243–256.
Wilks, D. S. , 2006b: Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 627 pp.
Wilks, D. S. , and T. M. Hamill , 2007: Comparison of ensemble-MOS methods using GFS reforecasts. Mon. Wea. Rev., 135 , 2379–2390.
Zsoter, E. , 2006: Recent developments in extreme weather forecasting. ECMWF Newsletter, No. 107, ECMWF, Reading, United Kingdom, 8–17.

Verification domain of the calibrated COSMO-LEPS forecasts. The colors show the model topography (m) above sea level on the approximately 10 km × 10 km grid. The blue dots show the stations with 24-h precipitation data available in the time range from 1971 to 2007.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Verification domain of the calibrated COSMO-LEPS forecasts. The colors show the model topography (m) above sea level on the approximately 10 km × 10 km grid. The blue dots show the stations with 24-h precipitation data available in the time range from 1971 to 2007.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
Verification domain of the calibrated COSMO-LEPS forecasts. The colors show the model topography (m) above sea level on the approximately 10 km × 10 km grid. The blue dots show the stations with 24-h precipitation data available in the time range from 1971 to 2007.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Relative difference (%) in the (a),(c) 80th and (b),(d) 95th quantile of the COSMO-LEPS 24-h precipitation climatology and observations in (a),(b) January and (c),(d) July from 1971 to 2000. The relative difference is calculated by dividing the difference with the average of model and observation precipitation.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Relative difference (%) in the (a),(c) 80th and (b),(d) 95th quantile of the COSMO-LEPS 24-h precipitation climatology and observations in (a),(b) January and (c),(d) July from 1971 to 2000. The relative difference is calculated by dividing the difference with the average of model and observation precipitation.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
Relative difference (%) in the (a),(c) 80th and (b),(d) 95th quantile of the COSMO-LEPS 24-h precipitation climatology and observations in (a),(b) January and (c),(d) July from 1971 to 2000. The relative difference is calculated by dividing the difference with the average of model and observation precipitation.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Illustration of the applied calibration method for an arbitrary forecast parameter x. (left) The frequency distribution of the observations (blue) and the correspondent, positively biased model climatology (red). The actual ensemble forecast is shown by the green lines. A positive model bias would lead to systematically too high forecast probabilities. (right) The cumulative distribution function shows that too many forecast members (here all) would exceed a given threshold xo. If the threshold is adjusted according to the return period of the model climatology xm, the fraction of forecast members exceeding the new threshold decreases [i.e., assuming that the return periods in the observation world and in the model world agree, a forecast of exceeding a return period (derived from the model climatology) has a reduced systematic error].
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Illustration of the applied calibration method for an arbitrary forecast parameter x. (left) The frequency distribution of the observations (blue) and the correspondent, positively biased model climatology (red). The actual ensemble forecast is shown by the green lines. A positive model bias would lead to systematically too high forecast probabilities. (right) The cumulative distribution function shows that too many forecast members (here all) would exceed a given threshold xo. If the threshold is adjusted according to the return period of the model climatology xm, the fraction of forecast members exceeding the new threshold decreases [i.e., assuming that the return periods in the observation world and in the model world agree, a forecast of exceeding a return period (derived from the model climatology) has a reduced systematic error].
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
Illustration of the applied calibration method for an arbitrary forecast parameter x. (left) The frequency distribution of the observations (blue) and the correspondent, positively biased model climatology (red). The actual ensemble forecast is shown by the green lines. A positive model bias would lead to systematically too high forecast probabilities. (right) The cumulative distribution function shows that too many forecast members (here all) would exceed a given threshold xo. If the threshold is adjusted according to the return period of the model climatology xm, the fraction of forecast members exceeding the new threshold decreases [i.e., assuming that the return periods in the observation world and in the model world agree, a forecast of exceeding a return period (derived from the model climatology) has a reduced systematic error].
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Example of a warngram for the calibrated COSMO-LEPS 24-h total precipitation forecast in May. The detailed description can be found in section 5.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Example of a warngram for the calibrated COSMO-LEPS 24-h total precipitation forecast in May. The detailed description can be found in section 5.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
Example of a warngram for the calibrated COSMO-LEPS 24-h total precipitation forecast in May. The detailed description can be found in section 5.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Attributes diagrams for uncalibrated (gray) and calibrated (black lines) COSMO-LEPS Q0.8 precipitation forecasts in (a),(c) winter (November–December 2006 and January–February 2007) and (b),(d) summer (June–August 2006 and 2007). (a),(b) 18–42- and (c),(d) 66–90-h lead-time forecasts. The inlays show the frequency distribution of verification pairs in the different forecast probabilities (0%, 100/16%, 200/16%, … , 100%) for uncalibrated (gray) and the calibrated (black) forecasts. The error bars represent the 5th and 95th quantiles derived by bootstrapping the sample.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Attributes diagrams for uncalibrated (gray) and calibrated (black lines) COSMO-LEPS Q0.8 precipitation forecasts in (a),(c) winter (November–December 2006 and January–February 2007) and (b),(d) summer (June–August 2006 and 2007). (a),(b) 18–42- and (c),(d) 66–90-h lead-time forecasts. The inlays show the frequency distribution of verification pairs in the different forecast probabilities (0%, 100/16%, 200/16%, … , 100%) for uncalibrated (gray) and the calibrated (black) forecasts. The error bars represent the 5th and 95th quantiles derived by bootstrapping the sample.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
Attributes diagrams for uncalibrated (gray) and calibrated (black lines) COSMO-LEPS Q0.8 precipitation forecasts in (a),(c) winter (November–December 2006 and January–February 2007) and (b),(d) summer (June–August 2006 and 2007). (a),(b) 18–42- and (c),(d) 66–90-h lead-time forecasts. The inlays show the frequency distribution of verification pairs in the different forecast probabilities (0%, 100/16%, 200/16%, … , 100%) for uncalibrated (gray) and the calibrated (black) forecasts. The error bars represent the 5th and 95th quantiles derived by bootstrapping the sample.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

(a) The BSSD for COSMO-LEPS 24-h precipitation forecasts for up to 114-h lead time. Scores of the uncalibrated (gray lines) and the calibrated (black lines) forecasts are shown for (a) winter (November–December 2006 and January–February 2007) and (b) summer (June–August 2006 and 2007) for Q0.8 (circles) Q0.95 (triangles) and Q0.975 (squares). All calibrated scores are significantly greater than the raw scores (α = 0.01).
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

(a) The BSSD for COSMO-LEPS 24-h precipitation forecasts for up to 114-h lead time. Scores of the uncalibrated (gray lines) and the calibrated (black lines) forecasts are shown for (a) winter (November–December 2006 and January–February 2007) and (b) summer (June–August 2006 and 2007) for Q0.8 (circles) Q0.95 (triangles) and Q0.975 (squares). All calibrated scores are significantly greater than the raw scores (α = 0.01).
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
(a) The BSSD for COSMO-LEPS 24-h precipitation forecasts for up to 114-h lead time. Scores of the uncalibrated (gray lines) and the calibrated (black lines) forecasts are shown for (a) winter (November–December 2006 and January–February 2007) and (b) summer (June–August 2006 and 2007) for Q0.8 (circles) Q0.95 (triangles) and Q0.975 (squares). All calibrated scores are significantly greater than the raw scores (α = 0.01).
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Relative improvement in forecast skill (BSSD) in the verification domain. Only significant (α = 0.05) grid points (bootstrapped 200 times, Q05 > 0 or Q95 < 0) are shown. (a) Lead time of 18–42 h, Q80. (b) Lead time of 18–42 h, Q95. (c) Lead time 90–114 h, Q80. (d) Lead time 90–114 h, Q95. Verification period is April 2006 to August 2007. The black dots denote grid points with no or negative skill in the uncalibrated forecast.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

Relative improvement in forecast skill (BSSD) in the verification domain. Only significant (α = 0.05) grid points (bootstrapped 200 times, Q05 > 0 or Q95 < 0) are shown. (a) Lead time of 18–42 h, Q80. (b) Lead time of 18–42 h, Q95. (c) Lead time 90–114 h, Q80. (d) Lead time 90–114 h, Q95. Verification period is April 2006 to August 2007. The black dots denote grid points with no or negative skill in the uncalibrated forecast.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
Relative improvement in forecast skill (BSSD) in the verification domain. Only significant (α = 0.05) grid points (bootstrapped 200 times, Q05 > 0 or Q95 < 0) are shown. (a) Lead time of 18–42 h, Q80. (b) Lead time of 18–42 h, Q95. (c) Lead time 90–114 h, Q80. (d) Lead time 90–114 h, Q95. Verification period is April 2006 to August 2007. The black dots denote grid points with no or negative skill in the uncalibrated forecast.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

BSSD for the 42-h lead time 24-h total precipitation forecast of different return periods dependent on the length of the subset from the model climatology that is used for calibrating the forecast. As reference, a forecast equal to the observation climatology is derived from a 300-day period around the day of interest.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

BSSD for the 42-h lead time 24-h total precipitation forecast of different return periods dependent on the length of the subset from the model climatology that is used for calibrating the forecast. As reference, a forecast equal to the observation climatology is derived from a 300-day period around the day of interest.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
BSSD for the 42-h lead time 24-h total precipitation forecast of different return periods dependent on the length of the subset from the model climatology that is used for calibrating the forecast. As reference, a forecast equal to the observation climatology is derived from a 300-day period around the day of interest.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

BSSD for the calibrated COSMO-LEPS with the raw 18–42-h forecast. The climatology was sampled with a window of 30 days around the actual forecast date using 1, 2, 5, 10, 15, 20, and 30 yr of reforecasts, randomly sampled. The horizontal lines show the scores of uncalibrated forecasts of the according quantile threshold.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1

BSSD for the calibrated COSMO-LEPS with the raw 18–42-h forecast. The climatology was sampled with a window of 30 days around the actual forecast date using 1, 2, 5, 10, 15, 20, and 30 yr of reforecasts, randomly sampled. The horizontal lines show the scores of uncalibrated forecasts of the according quantile threshold.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
BSSD for the calibrated COSMO-LEPS with the raw 18–42-h forecast. The climatology was sampled with a window of 30 days around the actual forecast date using 1, 2, 5, 10, 15, 20, and 30 yr of reforecasts, randomly sampled. The horizontal lines show the scores of uncalibrated forecasts of the according quantile threshold.
Citation: Monthly Weather Review 138, 1; 10.1175/2009MWR2977.1
Illustration of the interrelation between return period and quantile for the thresholds used in the verification study.


The BS, reliability (REL), and resolution (RES) of calibrated and uncalibrated (in parentheses) COSMO-LEPS 24-h total precipitation forecast in winter 2006/07 and summer 2007 for lead times (LT) of 42 and 90 h.

