Droughts diminish crop yields and can lead to severe socioeconomic damages and humanitarian crises (e.g., famine). Hydrologic predictions of soil moisture droughts several months in advance are needed to mitigate the impact of these extreme events. In this study, the performance of a seasonal hydrologic prediction system for soil moisture drought forecasting over Europe is investigated. The prediction system is based on meteorological forecasts of the North American Multi-Model Ensemble (NMME) that are used to drive the mesoscale hydrologic model (mHM). The skill of the NMME-based forecasts is compared against those based on the ensemble streamflow prediction (ESP) approach for the hindcast period of 1983–2009. The NMME-based forecasts exhibit an equitable threat score that is, on average, 69% higher than the ESP-based ones at 6-month lead time. Among the NMME-based forecasts, the full ensemble outperforms the single best-performing model CFSv2, as well as all subensembles. Subensembles, however, could be useful for operational forecasting because they are showing only minor performance losses (less than 1%), but at substantially reduced computational costs (up to 60%). Regardless of the employed forecasting approach, there is considerable variability in the forecasting skill ranging up to 40% in space and time. High skill is observed when forecasts are mainly determined by initial hydrologic conditions. In general, the NMME-based seasonal forecasting system is well suited for a seamless drought prediction system as it outperforms ESP-based forecasts consistently over the entire study domain at all lead times.
Droughts appear worldwide and belong to the most devastating natural catastrophes. Droughts are defined as dry anomalies and occur in all compartments of the hydrological cycle (Sheffield and Wood 2011), such as the atmosphere (meteorological drought), streamflow and groundwater (hydrological drought), and root-zone soil moisture (agricultural drought). We focus here on agricultural droughts because they are able to reduce crop yields, leading to substantial socioeconomic damages. For example, the 2003 European drought caused losses on the order of 13 billion euros (COPA-COGECA 2003), whereas in the United States it is estimated that droughts led to damages of 10 billion dollars on average per event (mainly agricultural but also others such as livestock; Smith and Katz 2013; Smith and Matthews 2015). In developing countries, droughts even threaten the livelihood of societies. The 2010/11 drought in the Horn of Africa, for example, led to a severe humanitarian crisis affecting around 12 million people (ReliefWeb 2015; Dutra et al. 2013). Drought early warnings can help to mitigate the impact of these disasters several months in advance, but only if they are based on skillful seasonal forecasting systems.
State-of-the-art seasonal forecasting systems employ either dynamical or statistical frameworks to generate a drought forecast. Statistical frameworks, for example, use conditional distribution functions of observed historical datasets for drought prediction (Madadgar and Moradkhani 2013). Dynamical prediction systems represent the physics of the earth system and typically constitute coupled atmosphere–ocean general circulation models (CGCMs), which provide climate forecasts (CFs) of meteorological variables (e.g., precipitation and air temperature). These forecasts are then used to force a hydrological model that can reliably simulate the land surface components of the hydrological cycle, such as root-zone soil moisture (SM). Previous studies have assessed the forecast skill of experimental prediction systems for specific drought events (Luo and Wood 2007; Dutra et al. 2013) as well as for multidecadal hindcast periods (Shukla and Lettenmaier 2011; Wang et al. 2011; Mo et al. 2012; Yuan et al. 2011, 2013a,b, 2015; Mo and Lettenmaier 2014; Shukla et al. 2014). In these studies, the ensemble streamflow prediction (ESP) approach is frequently used as a benchmark for representing climatological skill (Day 1985). ESP is a statistical method that resamples meteorological forcings from a historic dataset to represent the forcing uncertainty under unknown future conditions. It has been used to discriminate between the impact of initial hydrologic conditions (IHCs) and that of CFs on hydrologic predictions (Wood and Lettenmaier 2008; Shukla and Lettenmaier 2011; Shukla et al. 2013).
Previous studies indicate that SM predictability depends strongly on the region considered. For example, ESP-based SM forecasts in the western United States are as skillful as CF-based ones, while the latter only add value at 1-month lead time (Shukla and Lettenmaier 2011; Mo et al. 2012). In contrast, the National Centers for Environmental Prediction (NCEP) Climate Forecast System, versions 1 and 2 (CFSv1 and CFSv2), provide more skillful SM drought forecasts than ESP in the central and eastern United States at up to 6-months lead time (Yuan et al. 2013b). This might be related to stronger correspondence of drought to El Niño–Southern Oscillation (ENSO) in these regions and thus a higher atmospheric predictability (Mo 2011; Mo and Lyon 2015). A similar finding has been observed by Dutra et al. (2013) for a hindcast of the 2010/11 Horn of Africa drought using the European Centre for Medium-Range Weather Forecasts (ECMWF) seasonal forecasting systems S3 and S4. They reported high predictability for periods associated with a La Niña event and less predictability otherwise. Although such ENSO teleconnections are weaker in Europe, Yuan et al. (2015) observed that CGCM-based drought forecasts exhibit higher skill than ESP-based ones at up to 5-months lead time over the Danube River basin. In that study, the authors employed the recent North American Multi-Model Ensemble (NMME), which comprises 71 realizations of a multi-institutional, multimodel ensemble of climate forecast models at lead times of up to 9–10 months (Kirtman et al. 2014). The spatiotemporal distribution of SM drought forecasting skill using NMME over Europe has, however, not yet been fully evaluated. A high forecasting skill irrespective of the location and lead time is a fundamental requirement for a seamless prediction system.
Few studies focused on drought predictability during particular drought phases, such as the development, onset, and recovery. In one of these, Mo (2011) reported that drought recovery is more difficult to predict in the United States because it evolves on a shorter time scale than the development. Yuan and Wood (2013) reported that NMME models add skill to forecasts of meteorological drought onsets in tropical regions, but not in extratropical ones. In contrast to precipitation, SM drought predictability depends strongly on the IHCs (Wood and Lettenmaier 2008), which are substantially drier during the recovery than during the development phase. This characteristic has not been exploited when investigating the impact of IHCs on SM forecasts.
Multimodel forecasting ensembles such as CFSv2, ECMWF S4, and NMME have ever-increasing ensemble sizes to provide a better estimate of model uncertainty. This implies that they also offer more than one meteorological forcing time series for assessment studies. Nonetheless, most assessment studies focus only on the grand ensemble mean (e.g., Dutra et al. 2013; Mo et al. 2012; Yuan et al. 2013b, 2015). Few studies related the performance of the grand ensemble to that of individual models (Yuan and Wood 2013; Mo and Lettenmaier 2014). Thober and Samaniego (2014) recently showed that investigating subensembles, which do not take all realizations into account, has the potential to increase ensemble performance for reproducing extreme precipitation and temperature indices. Considering the fact that SM predictability is highly dependent on the quality of precipitation forecasts, subensembles could help either to increase the forecasting skill or to reduce computational load for operational forecasts without losing predictability.
Given the current knowledge regarding NMME-based SM drought forecasts over Europe, four research questions constitute the main goal of this study:
Are NMME-based drought forecasts more skillful than ESP-based ones over larger parts of the European domain?
How is the drought forecasting skill distributed in space and time?
How skillful are subensembles in forecasting European droughts in comparison to single NMME models and the full ensemble?
How do IHCs impact drought forecasting skill during drought development and recovery?
To address these research questions, the mesoscale hydrologic model (mHM; Samaniego et al. 2010; Kumar et al. 2013b) is used to simulate SM for monthly NMME-based precipitation and air temperature forecasts for the hindcast period of 1983–2009. These NMME-based forecasts are contrasted against those based on the ESP approach, which serve as a benchmark in this study. The mHM-derived SM forecasts are then transformed to a quantile-based soil moisture index (SMI). The SMI lies in the interval [0, 1] and a threshold of 0.2 is used to classify droughts. This cutoff implies that the lower 20% of SM states occurring in a given period (e.g., a month) are considered as drought. Reference SMI fields are created using the observation-based E-OBS (Haylock et al. 2008) to assess the skill of the different forecasting approaches employing the Pearson correlation coefficient R and the equitable threat score (ETS).
2. Methods and datasets
a. Climate forecasts
The forecasting dataset used in this study incorporates realizations of eight global climate models from the NMME with ensemble members varying between 6 and 24 per model [Table 1; see also Kirtman et al. (2014)]. Monthly CFs of precipitation and air temperature are provided globally at a 1° × 1° spatial resolution for lead times of up to 8 months. In total, 101 realizations are used in this study (available from the International Research Institute for Climate and Society). The performance of these models for soil moisture drought forecasts is analyzed for the overlapping hindcast period of 1983–2009. It has to be mentioned that not all of these models are participating within the NMME phase-2 real-time dataset (NMME Project 2014). The analysis of the hindcast dataset in this study, however, provides the opportunity to investigate the performance of a large ensemble of seasonal climate model predictions in comparison to that of a simple statistical approach. The analysis is conducted over the European domain covering an area between 35° and 55°N and 10°W and 45°E.
b. Construction of soil moisture forecasts
The well-constrained mHM (Samaniego et al. 2010; Kumar et al. 2013b) is used here to generate gridded estimates of SM fields over the study domain. The mHM is a spatially explicit distributed hydrologic model in which hydrological processes are conceptualized similar to these of other existing large-scale models like the VIC (Liang et al. 1996) and the Water–Global Assessment and Prognosis (WaterGAP) model (Döll et al. 2003). It is driven by daily gridded fields of precipitation, air temperature, and potential evapotranspiration to simulate different components of the terrestrial hydrological system, such as canopy interception, snow accumulation and melt, soil moisture and infiltration, runoff generation and evapotranspiration, deep percolation and base flow, and flood routing between grid cells. The model is open source (www.ufz.de/mhm), and readers interested in more details may refer to Samaniego et al. (2010). To date, mHM has been successfully applied to several river basins in Germany, North America, and Europe (Samaniego et al. 2010, 2013, 2014; Kumar et al. 2013a,b). In this study, we use a similar model setup with respect to terrain, soil, and land-cover characteristics as that used by Rakovec et al. (2015), who demonstrated the ability of mHM to adequately represent the spatiotemporal dynamics of runoff, evapotranspiration, soil moisture, and total water storage anomaly over a wide range of European river basins.
The reference monthly SM field is obtained by forcing mHM with the observation-based E-OBS (version 8.0; Haylock et al. 2008) during the period 1950–2010. E-OBS is aggregated to 1° grid resolution to be compatible with the resolution of the NMME dataset. This reference SM field is then used to represent IHCs at the beginning of each month during the hindcast period (1983–2009).
Furthermore, E-OBS is used to set up the NMME- and ESP-based forecasts. The ESP forecast ensemble is created by resampling the meteorological dataset (i.e., E-OBS) of the hindcast period for a given target month excluding the year of that month, which is similar to the approach of previous studies (e.g., Twedt et al. 1977; Day 1985; Wood and Lettenmaier 2008; Shukla et al. 2013). In total, the ESP forecasting ensemble consists of 26 members. The spatiotemporal variability of E-OBS is employed to disaggregate NMME-based monthly precipitation forecasts to their corresponding daily values using a multiplicative cascade approach (Thober et al. 2014). This approach preserves the observed spatial patterns at the daily time scale as well as the monthly amount of the forecasted precipitation. Each monthly NMME forecast is stochastically disaggregated to an ensemble of 25 daily realizations, thus increasing the overall ensemble size to 2525 (=101 × 25). The daily weights for disaggregating the monthly temperature forecasts are derived from E-OBS for a given target month. This procedure is similar to the rescaling technique used by Yuan et al. (2015). The rescaled temperature estimates are then also used to adjust potential evapotranspiration, which is calculated using the Hargreaves–Samani approach (Hargreaves and Samani 1985). The daily mHM-derived SM fields for both forecasting systems are then averaged to their monthly estimates. A representative SM field for a given NMME model realization is created by averaging the corresponding estimates derived from the 25 disaggregated meteorological forecasts because there is no significant variability among the latter fields as they are all forced with the same monthly precipitation and air temperature.
c. Calculation of soil moisture index
The monthly SM fields are converted into their respective quantiles using a nonparametric kernel density estimation method for the drought analysis. The kernel density is estimated by
for a given sample of n SM fractions , bandwidth h, and kernel function K. A Gaussian kernel is used in this study and the bandwidth is estimated by an optimization against a cross-validation error estimate [see Samaniego et al. (2013) for details]. The respective quantiles, hereafter denoted as SMI, and the corresponding distribution functions are estimated for each grid cell and calendar month independently. This procedure removes the seasonality of simulated SM and allows the comparability of SMI across locations. An SMI threshold value of 0.2 is used here to identify drought events following previous studies (e.g. Andreadis et al. 2005; Vidal et al. 2010; Sheffield et al. 2012; Samaniego et al. 2013).
The monthly SM estimates are converted to their respective standardized anomalies prior to the conversion of SM to SMI to ensure their comparability across different realizations, climate models, and forecasting methods (Koster et al. 2009). The standardized anomalies are obtained by removing the seasonal mean and standard deviation. In this approach, the distribution function is estimated only once using the reference SM anomalies. The forecasted SM anomalies are converted to SMI using this unique distribution function. This procedure provides a fair comparison between NMME- and ESP-based forecasts. In this study, no bias correction is applied to the NMME forecasts because the SMI calculation and the standardization of SM forecasts accounts for biases, particularly in the mean and standard deviation, as long as these biases are small and do not lead to unrealistic model behavior. The standardization of SM has also been exploited in previous studies to ensure comparability among different SM products (Dirmeyer et al. 2004; Koster et al. 2009; Wang et al. 2011). It is worth mentioning that bias correction is crucial for the correct quantification of hydrological fluxes in other applications where even small biases would modify the results substantially, such as streamflow predictions (e.g., Luo et al. 2007; Mo and Lettenmaier 2014).
Three SMI forecasting ensembles are created in this study: two based on NMME forecasts and one based on ESP. The two NMME-based approaches differ with respect to the employed averaging scheme. In the first approach, SMI forecasts are created for all 101 model realizations independently, and these are then averaged to obtain a grand NMME mean for SMI. This approach is denoted as . In the second approach, the SM fields are first averaged over all model realizations to create a grand NMME mean for SM. The latter is then transformed to its respective SMI. This approach is denoted as . These two approaches will provide different results, because the SMI calculation is a highly nonlinear transformation. Investigating these two averaging schemes will help to determine the best possible NMME drought forecasting skill.
d. Subensemble selection
The NMME-based forecasts are further evaluated with respect to the performance of subensembles, as these might give a better performance as the full ensemble but with a reduced computational demand. There are several subensemble selection methods available to identify the best-performing subensemble, and the backward search algorithm is used in this study as suggested by Thober and Samaniego (2014). This algorithm is computationally efficient because it does not require the evaluation of all possible subensemble combinations. The algorithm is summarized here:
Select all NMME models as the first subensemble.
Sequentially remove a remaining model from the subensemble and evaluate the corresponding performance (e.g., Pearson correlation coefficient).
Repeat step 2 for all remaining models contained in the subensemble.
Replace the subensemble with the combination exhibiting the highest performance found in steps 2 and 3.
Repeat steps 2–4 until the subensemble contains only a single model.
Select the combination with the highest performance as the best-performing subensemble.
3. Results and discussion
a. Representation of spatiotemporal SMI dynamics
The overall skill of the NMME- and ESP-based forecasts to mimic the spatiotemporal dynamics of the reference SMI is analyzed for different lead times using the Pearson correlation coefficient (Fig. 1). Two different averaging schemes have been employed to create the NMME-based forecasts (section 2c). All three methods have a comparably high skill at 1-month lead time (R ≈ 0.9), confirming the strong influence of IHCs on SM forecasts at a short lead time (Wood and Lettenmaier 2008; Shukla et al. 2013). Expectedly, the forecasting skill decreases with increasing lead time, but the rate of this decrement is method dependent. For instance, the spatially averaged R value for ESP-based forecasts drops from 0.90 at 1-month lead time to 0.32 at 6-months lead time (around 65% loss; Figs. 1g–i). For NMME-based forecasts, which have been created by the averaging approach, the skill decreases from 0.87 to 0.25 (around 71% loss; Figs. 1a–c). This is the strongest decrement among all considered methods, and also the lowest performance at any lead time. On the contrary, NMME-based forecasts created by the averaging approach have the highest performance and the lowest decrement among all considered methods (around 42% loss; Figs. 1d–f).
The outperformance of the approach is also present for all four seasons (Table 2). The SMI forecasting skill is highest in winter [December–February (DJF)] for all considered methods that might be related to snowpack, which has a high influence on soil moisture development in the following months. NMME-based forecasts also benefit from a higher precipitation forecasting skill during these seasons, particularly at a 1-month lead time [see Fig. 2 in Mo and Lyon (2015)]. For 1-month and 3-months lead times, low forecasting skills are observed during autumn [September–November (SON)]. Interestingly, these shift to summer [June–August (JJA)] for 6-months lead time. This implies that the forecasting skill is small for forecasts ending at the beginning of winter. This might be related to the fact that higher evapotranspiration during summer and autumn reduce SM persistence during these seasons. Since the ordering of the different methods does not change with season (Table 2), the average forecasting skill over the whole year is investigated in the following analysis.
Although the different forecasting methods yield distinctively different skill, the spatial patterns among the corresponding forecasts are very similar (Figs. 1a–i). This is observed for any lead time. Regions exhibiting consistently higher skill are located for all methods in Poland, northern France, and eastern Ukraine with relatively less skill in the Alps (i.e., northern Italy, Switzerland, and Austria) and in the Pyrenees along the Spanish–French border. These patterns compare remarkably well with those of the persistence map of reference SMI (Figs. 1j–l). A high persistence (i.e., autocorrelation) of reference SMI indicates that SM states are exhibiting a long memory, which induces a high dependence of SMI forecasts on IHCs. In this study, perfect knowledge of IHCs is assumed (i.e., they are the same for all forecasts and the reference dataset), which leads to a high SMI forecasting skill (i.e., a high R) at locations exhibiting high SM persistence. On the contrary, SMI forecasts at locations having a short memory will be more dependent on CFs, and the large uncertainty therein reduces the ability to represent reference SMI dynamics.
NMME precipitation forecasting skill is very low over Europe (Figs. 2a–c), as found in previous studies (Yuan and Wood 2012b; Yuan et al. 2015; Mo and Lyon 2015). It is, however, significant for 1-month lead time. Temperature forecasting skill is comparatively high and does not decrease with increasing lead time (Figs. 2d–f). Notably, the cumulative precipitation and average air temperature are considered at 3- and 6-months lead times because droughts are creeping events that depend more on the integrated forecasting skill than at the forecasting skill of a particular month. It appears that the seasonality helps to achieve a skillful forecast for temperature at a long lead time. ESP-based predictions do not exhibit any skill for temperature and precipitation forecasts because this method uses only climatological information. It is thus not surprising that the relatively high skill in temperature and the significant skill in precipitation for 1-month lead time induces a higher skill into the NMME-based forecasts compared to those based on ESP.
The spatial patterns of forecasting skill show a higher agreement with the reference SMI persistence than with those of meteorological forecasting skill (cf. Fig. 1 and Fig. 2). This highlights the fact that the IHCs have a higher impact on the spatial variability of SMI forecasting skill than the CFs. The latter, however, causes the outperformance of NMME-based forecasts in comparison to ESP-based ones. These results illustrate the complex interactions between IHCs, CFs, and SMI forecasting skill.
In general, the NMME-based forecasts outperform the ESP-based ones by 69%, on average, at 6-months lead time (compare ETS in Figs. 1f,i). A similar outperformance has also been reported by Yuan et al. (2015) using bias-corrected CFs. No bias correction is applied to the CFs in the present study because the SMI calculation using standardized SM anomalies implicitly accounts for biases in SM as long as the obtained SM dynamics are not unrealistic (e.g., a constantly saturated soil). This illustrates that bias correction of state-of-the-art CFs might not be required to obtain a high forecasting skill for SM drought prediction. An analogous finding was reported by Yuan and Wood (2012a) for streamflow, who demonstrated that driving a hydrologic model with raw CFs and subsequently bias correcting the simulated streamflow results in a skillful prediction of the latter.
b. The effect of model averaging
In addition to the initial land surface conditions, the averaging scheme employed to create the NMME-based forecast has a decisive impact on the skill of representing reference SMI dynamics (Figs. 1a–f). Notably, the ensembles created by the averaging scheme outperform ESP-based forecasts, while the ensembles created with the approach do not. This implies that the kind of averaging applied can have large impacts on the conclusions drawn in previous studies investigating the capabilities of ensemble drought prediction systems (Wang et al. 2011; Mo et al. 2012; Mo and Lettenmaier 2014; Yuan et al. 2013b, 2015). The SMI values of individual models are often recast to one of the ensembles in these studies, and the skill of drought prediction systems might be further increased by using averaging schemes that preserve the frequency of SMI values and therefore capture extremes.
The 24-member CFSv2 ensemble is used as one example to illustrate the impact of different averaging schemes on SMI dynamics (Fig. 3). A strong annual cycle can be observed for both the forecasted and the reference SM fractions. The mean SM forecast tends to overestimate the reference one, but the latter is mostly within the uncertainty bound of the forecast (Fig. 3a). The SMI, however, does not exhibit an annual cycle because the climatology of SM is treated separately for each calendar month in the SMI estimation (section 2c). The ensemble SMI forecasts tend to show a similar temporal dynamic as the reference one, but at the expense of an increased model spread compared to their respective SM forecasts (Fig. 3b). Because of the increased model spread for SMI, there is always an SMI forecast that is not under drought at a given forecast date. As a result, the averaging approach does not detect drought events given a 0.2 drought threshold (i.e., no time step is identified to be under drought). The reason is that the average of different SM indices is not a quantile-based index itself. For example, it does not fulfill the condition that 20% of the time steps exhibit an SMI less than 0.2. The scheme captures both the wet and dry extremes better than the scheme and also preserves the property that 20% of the SMI time steps are below 0.2, which is crucial for drought analysis. The same effect was noticed for the other NMME models. Hence, the averaging scheme based on the approach is used in the further analysis.
c. Subensemble and single model performance for SMI and drought forecasts
Investigating the performance of subensembles is crucial to correctly determine the best possible performance of a given ensemble dataset. The backward selection algorithm proposed by Thober and Samaniego (2014) is used to identify subensembles of decreasing size based on Pearson correlation coefficient and ETS separately. The former criterion accounts for both wet and dry extremes, while the latter is used to measure the skill of forecasts to capture drought events based on a 0.2 SMI threshold (see appendix A for further details of the ETS). The selected subensemble should exhibit a high skill regardless of location and time step considered, which is a basic requirement for a seamless prediction system. Additionally, it is assumed that different subensembles distribute forecasting skill over seasons in a similar fashion, which has been also observed for the different forecasting methods (Table 2). For these reasons, the performance criteria are averaged over space, lead time, and forecasting time step.
The skill of any considered subensemble is higher than those of the single models for both criteria (Fig. 4a). On the contrary, ESP has the lowest performance among all considered approaches for R and only marginally outperforms the worst-performing model (CCSM3) for ETS. CFSv2 is the best-performing model, and the ordering of the single models is the same for R and ETS, with the exception of the second- and third-best models, which swap their places (CanCM3 and GEOS-5). As a consequence, the models selected within the subensembles are quite similar for the two criteria (Fig. 4b). Only the selected subensembles of size six are different by more than one model. For both criteria, the backward search algorithm correctly identifies CFSv2 as the single best-performing model. It is worth noting that the algorithm would select a different model if the best-performing model would have been deselected in a previous iteration. Such a result has been reported for the ENSEMBLES dataset (Thober and Samaniego 2014).
The performance of the subensembles decreases monotonically with decreasing ensemble size for both criteria (Fig. 4a). This justifies the approach pursued in previous studies to use the full ensemble, as it exhibits the best possible performance (e.g., Yuan and Wood 2013; Mo and Lettenmaier 2014; Yuan et al. 2015). However, the selected subensembles containing four models require 60% of the computational costs of the full ensemble to achieve a skill, which is only 0.3% and 0.5% less than that of the full ensemble for R and ETS, respectively. This highlights that operational forecasting could benefit from using subensembles in favor of the full ensemble because of the reduced computational demand. The performance of the full NMME ensemble (NMME8) is contrasted with that of a subensemble containing four models (NMME4) in the following analysis to further illustrate this aspect. Without loss of generality, NMME4 evaluated against ETS is chosen because it shows a similar performance as that evaluated against R (the R value is only 1% less). The four models contained in NMME4 are CFSv2, CanCM3, ECHAMD, and CFSv1 (Fig. 4b). Only two of these models (CFSv2 and CanCM3) are, however, currently operational in the NMME phase 2 (NMME Project 2014).
Although subensembles consistently outperform single models and ESP, the spread of both criteria is relatively narrow. This is due to the fact that the initial hydrologic conditions are the same for all forecasting methods, which reduces the variability among the different soil moisture forecasts. In other words, the high variability in climatic forecasts is dampened while propagating through the hydrologic system exhibiting long memory. It is worth mentioning that substantially different subensemble performances have been observed for atmospheric variables like extreme precipitation indices (Thober and Samaniego 2014).
d. Spatiotemporal distribution of drought forecasting skill
It is desirable for a drought prediction system to be seamless with a high forecasting skill regardless of the location and the lead time. The forecasting skill of most prediction systems, however, varies in space and time (Shukla et al. 2013; Dutra et al. 2013; Yuan and Wood 2013). The spatiotemporal distribution of ETS is analyzed here to understand these variations as well as the factors that influence drought forecasting skill.
Distinctive spatial patterns in ETS are observed for both NMME and ESP (Fig. 5), which are similar to those of the Pearson correlation for the reference SMI dynamics (Figs. 1j–l). This illustrates that the impact of IHCs is also evident for extreme conditions. The differences in ETS between two locations across the study domain are as high as 40% (e.g., difference between Switzerland and Poland at 1-month lead time for NMME8; Fig. 5a). These spatial differences are larger than the differences between the NMME8 and ESP forecasting approaches, which range up to 8% on average at 6-months lead time. It is worth noting that the spatial distribution between NMME8 and NMME4 is very similar (Fig. 5). At 90% of the grid cells, the differences between these two ensemble-based forecasts are smaller than 5% in terms of ETS irrespective of the lead time.
This skill of both the NMME8 and ESP forecasting methods also depends on the forecast date (Figs. 6b–d). The differences between the smallest and highest ETS can be also as high as 40% for both forecasting methods, whereas the maximum difference between NMME8 and ESP forecasts at any given time step is at most 20%. Both forecasting methods, as expected, show lower ETS values at longer lead times, but the rate of decrement is less for NMME8 than for ESP. This leads to the relative outperformance of 69% on average at 6-months lead time, as discussed above (section 3a). These results illustrate the added value of an ensemble seasonal forecasting system at longer lead times (Mo and Lettenmaier 2014). In general, NMME8 forecasts significantly outperform ESP ones at any location and lead time at a 5% significance level, which has also been reported by Yuan et al. (2015) using VIC over the Danube basin in Europe. This result is obtained by applying a Student’s t test, which has been previously used in drought prediction studies (Wilks 2011; Yuan et al. 2015). A similar result is obtained for the NMME4 subensemble, which requires only 60% of the computational demand as compared to the NMME8 (not shown).
The spread of single model performance is significantly narrower for the full NMME (19% on average) as compared to that of ESP (29% on average) at a 5% significance level (Figs. 6b–d). A similar result is obtained when the same number of samples (forcing members) is evaluated for NMME8 and ESP. The higher uncertainty for the ESP-based forecasts can be mostly attributed to poorly performing forecasts. The spread of ETS for the NMME8-based forecasts is often located within the upper tail of that estimated for the ESP-based ones. The skill of the full NMME is comparable to that of the best-performing model at a given forecast date (i.e., the upper limit of single model spread shown in Figs. 6b–d), which has also been reported for an NMME-based prediction system over the contiguous United States (CONUS; Mo and Lettenmaier 2014). It is worth noting that there exists not a single model that outperforms all others at all forecasting dates. For example, CFSv2 only outperforms all other models at 20% of all forecasting dates, although it is the overall best-performing model (as discussed above; Fig. 4). This again highlights the advantage of using ensemble-based forecasts over ones based on a single model.
The temporal dynamics of ETS for the full NMME and ESP are quite similar (Figs. 6b–d), which again signifies the role of IHCs for drought predictions. Low ETS values are generally observed during periods of drought recovery with less extensive droughts (e.g., 1988, during autumn 1998, and at the end of 2004; Fig. 6a). Both forecasting methods overestimate the drought extent during these periods, which results in a high false alarm rate and thus reduces ETS. On the contrary, high ETS values are observed during drought development phases (e.g., during 1990, 1994, and summer of 2005). These results illustrate that the drought forecasting skill varies depending on the states of drought events (e.g., drought development and recovery). These are defined in the following section.
e. Forecasting skill during drought development and recovery
To further investigate the forecasting skill during drought development and recovery phases, two drought characteristics are analyzed for major drought events that cover more than 20% of the European domain (e.g., the 1983, 1990, and 2003 drought; see also Fig. 6a). A drought time step is defined as development (recovery) if it occurs before (after) the peak extent of the respective event. The two characteristics are the drought severity and the area under drought (see appendix B for details). Both of these characteristics are normalized by their corresponding reference estimates (based on E-OBS) to make them comparable among different events. The perfect forecast would correspond to a value of one for both characteristics. The drought characteristics during both phases are calculated for all NMME and ESP ensemble members separately. Finally, a probability density function is estimated jointly for the two characteristics using a kernel estimation method [Eq. (1)] to assess their associated spread, following the procedure used by van Loon et al. (2014).
In general, the forecasted drought severity matches the median reference one quite well, with deviations less than 20% irrespective of the lead time, drought phase, and forecasting method (horizontal lines in Fig. 7). On the contrary, substantial underestimations in drought area are observed with increasing lead time up to 55% for NMME8, 51% for NMME4, and 68% for ESP (vertical lines in Fig. 7). Additionally, these are more pronounced during drought development phases than during recovery phases. In summary, the drought forecasts exhibit a higher mismatch in correctly detecting reference drought location. If a drought has been correctly forecasted at a given location, then it is likely that the severity of this event would be comparable to that of the reference one.
The spread of drought severity and area increases with lead time for all forecasting methods (see regions containing 90% of the density in Fig. 7). Expectedly, the relatively larger uncertainty in climatic forecasts at longer lead times causes a higher spread in drought characteristics (Wood and Lettenmaier 2008; Shukla and Lettenmaier 2011). This spread is larger during the drought recovery than during the development phases at a long lead time, which is in agreement with Mo (2011), who reported that drought development is more predictable than drought recovery.
The spread is also remarkably similar for the NMME8- and NMME4-based forecasts. For example, there is a comparable overlap of spread estimated during the drought development and the recovery phases at 3-months lead time. This overlap is considerably different from that observed for ESP-based forecasts (cf. Figs. 7b,e,h). These results illustrate that the NMME4 subensemble also has a similar performance as the full NMME during different drought phases, but only requiring 60% of the computational resources.
In general, all forecasting methods underestimate the reference drought severity during the drought development phases at all lead times (Fig. 7). This results from too wet forecasts leading to higher soil moisture index conditions as compared to the relatively drier reference ones. On the contrary, drought severity is overestimated during the drought recovery phases at 3- and 6-months lead times. The forecasts are drier than the reference one in this case. In other words, they are not able to add sufficient SM to recover from the drought. These results illustrate the fundamental influence of IHCs that persist throughout the drought forecasts, leading to a consistent lag of these with respect to the reference soil moisture index dynamics (see also Fig. 3b). This is expected for ESP, as it represents a climatological forecast and the skill is mainly derived from the correct representation of IHCs (Koster et al. 2004; Shukla et al. 2013). The skill of NMME-based forecasts has a similar dependence on the IHCs as ESP despite that NMME models represent physical dynamics of the earth system. They do, however, provide a substantially better forecast for drought area as compared to ESP (Fig. 7).
4. Summary and conclusions
In this study, the skill of a seasonal hydrologic prediction system for soil moisture (SM) drought forecasts is evaluated over Europe for a 27-yr hindcast period (1983–2009). The prediction system is based on meteorological forecasts of the North American Multi-Model Ensemble (NMME) that are used to drive the mesoscale hydrologic model (mHM). The skill of NMME-based forecasts is contrasted with that of the ensemble streamflow prediction (ESP) approach. The obtained SM estimates from both forecasting approaches are transformed to a quantile-based soil moisture index (SMI) to conduct a drought analysis using a 0.2 SMI threshold. Drought prediction skill is quantified in terms of the equitable threat score (ETS) employing a reference SMI field. The latter has been created using the observation-based E-OBS.
NMME-based forecasts significantly outperform ESP-based ones, particularly at a long lead time (i.e., up to 69% higher ETS at 6-months lead time). This is achieved only if the SMI has been calculated for the grand ensemble SM mean. In contrast, the grand ensemble SMI obtained by averaging single NMME model–based SMIs does not outperform the ESP-based one. Among the NMME-based forecasts, the full ensemble outperforms the single models as well as all selected subensembles. There is a considerable variability in the skill of SMI forecasts over Europe (i.e., up to 40% in space and time), regardless of the forecasting approach. This variability is strongly related to the persistence of reference SM, illustrating the strong impact of initial hydrologic conditions (IHCs) on SM drought forecasts. The IHCs are wetter during drought development phases than during drought recovery phases, which induces an underestimation of drought severity during the former and an overestimation during the latter phases.
The main conclusion of this study is that NMME-based forecasts are useful for seasonal SM drought prediction over Europe, which is in accordance with recent studies for the CONUS and GEWEX river basins using the VIC land surface scheme (Mo and Lettenmaier 2014; Yuan et al. 2015). The NMME-based forecasts are well suited for a seamless prediction system, as their skill is consistently higher than that of ESP-based ones over the entire study domain at all lead times.
The selected subensembles only show performance losses less than 1% on average in comparison to the full ensemble, but at 60% of the computational demand. Subensembles thus provide a promising alternative to the full ensemble and might be useful for operational seasonal SM drought forecasting. The subensemble skill has been averaged over space, lead time, and forecasting time step because the subensemble should exhibit a high skill regardless of the location and time step to be useful for a seamless prediction system. Alternative selection methods, however, could take the spatiotemporal variability of forecasting skill into account in the selection process. They should also test whether the skill of subensembles is stationary in time, which is a crucial requirement for operational forecasting. Moreover, bias correction of raw meteorological data has little impact on SM drought forecasting skill because the calculation of the quantile-based SMI already accounts for systematic biases, particularly in the mean and standard deviation, as long as these do not lead to unrealistic SM dynamics.
The results of this study illustrate the ubiquitous impact of IHCs on SM drought forecasting skill. The uncertainty associated with imperfect IHCs is, however, not considered here. Methods for further evaluating this aspect such as the reverse ESP approach have been investigated in previous studies using observational datasets (Wood and Lettenmaier 2008; Shukla and Lettenmaier 2011; Shukla et al. 2013). With the increase of computational resources, these should also be considered in the evaluation of ensemble SM drought prediction systems, such as those based on the NMME. Future studies could investigate the NMME phase-2 data containing real-time forecasts instead of the hindcast dataset explored in this study.
This study was carried out within the Helmholtz Association climate initiative REKLIM (www.reklim.de) and was supported by the Helmholtz Interdisciplinary Graduate School for Environmental Research (HIGRADE). We acknowledge E-OBS from the EU-FP6 project ENSEMBLES (http://ensembles-eu.metoffice.com) and the data providers in the ECA&D project (http://www.ecad.eu). We also would like to thank the International Research Institute for Climate and Society (IRI; http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/) for making the NMME dataset available. Moreover, we thank the editor and three anonymous reviewers for their constructive comments that helped to further improve this text.
Equitable Threat Score
Forecast verification for discrete events (e.g., a drought event) is commonly carried out using measures that are based on a 2 × 2 contingency table (Wilks 2011). In this study, we use the ETS as skill measure, which is defined as
where a is the number of drought events that occur in both the forecast and the reference dataset (commonly called hits), b is the number of drought events that occur in the forecast but not in the reference dataset (commonly called false alarms), and c is the number of droughts that occur not in the forecast but in the reference dataset (commonly called misses). The variable is defined as
where n is the total number of time steps. ETS is used in this study because it condenses the hit rate and the false alarm rate into one metric. An ETS of 100% indicates a hit rate of 1 and a false alarm rate of 0, which means that all drought events are forecasted perfectly.
Drought Severity and Area
Two drought characteristics are evaluated during the drought development and recovery phase. These are the fraction of correctly forecasted drought area and the drought severity of this area. For a given time step t, the former is defined as
where is the number of grid cells under drought both in the forecast and the reference dataset at time step t and is the number of grid cells under drought that occur not in the forecast but in the reference dataset at time step t. It is worth mentioning that this area is equivalent to the hit rate estimated over space.
The drought severity is calculated for the grid cells that exhibit a drought both in the forecast and the reference dataset. For a given time step t, the drought severity is defined as
where τ is the SMI drought threshold (here 0.2), is the positive part function, and is defined as above. A large deviation from the drought threshold leads to higher severity, indicating a more severe drought. The severity of the forecast is then normalized by that of the reference dataset [calculated over the same area using Eq. (B2)] to make them comparable among different drought events.