Accumulated precipitation forecasts are of high socioeconomic importance for agriculturally dominated societies in northern tropical Africa. In this study, the performance of nine operational global ensemble prediction systems (EPSs) is analyzed relative to climatology-based forecasts for 1–5-day accumulated precipitation based on the monsoon seasons during 2007–14 for three regions within northern tropical Africa. To assess the full potential of raw ensemble forecasts across spatial scales, state-of-the-art statistical postprocessing methods were applied in the form of Bayesian model averaging (BMA) and ensemble model output statistics (EMOS), and results were verified against station and spatially aggregated, satellite-based gridded observations. Raw ensemble forecasts are uncalibrated and unreliable, and often underperform relative to climatology, independently of region, accumulation time, monsoon season, and ensemble. The differences between raw ensemble and climatological forecasts are large and partly stem from poor prediction for low precipitation amounts. BMA and EMOS postprocessed forecasts are calibrated, reliable, and strongly improve on the raw ensembles but, somewhat disappointingly, typically do not outperform climatology. Most EPSs exhibit slight improvements over the period 2007–14, but overall they have little added value compared to climatology. The suspicion is that parameterization of convection is a potential cause for the sobering lack of ensemble forecast skill in a region dominated by mesoscale convective systems.
The bulk of precipitation in the tropics is related to moist convection, in contrast to the frontal-dominated extratropics. Because of the small-scale processes involved in the triggering and growth of convective systems, quantitative precipitation forecasts are known to have overall poorer levels of skill in tropical latitudes (Haiden et al. 2012). This can be monitored in quasi–real time via the World Meteorological Organization (WMO) Lead Centre on Verification of Ensemble Prediction System website (http://epsv.kishou.go.jp/EPSv) by comparing deterministic and probabilistic skill scores for 24-h precipitation forecasts for the 20°N–20°S tropical belt with those for the Northern and Southern Hemisphere extratropics. There are hints that precipitation and cloudiness forecasts in the tropics show enhanced skill during regimes of stronger synoptic-scale forcing (Söhne et al. 2008; Davis et al. 2013; Van der Linden et al. 2017) or in regions of orographic forcing (Lafore et al. 2017), but large parts of the tropical landmasses are dominated by convection that initiates from small-scale surface and boundary layer processes and sometimes is organized into mesoscale convective systems (MCSs). The latter depends mostly on the thermodynamic profile and vertical wind shear.
Within this context, northern tropical Africa, particularly the semiarid Sahel, can be considered a region where precipitation forecasting is particularly challenging. The area consists of vast flatlands, where MCSs during boreal summer provide the bulk of the annual rainfall (Mathon et al. 2002; Fink et al. 2006; Houze et al. 2015) and convergence lines in the boundary layer or soil moisture gradients at the kilometer scale can act as triggers for MCSs (Lafore et al. 2017). Sahelian MCSs often take the form of meridionally elongated squall lines with sharp leading edges characterized by heavy rainfall. Synoptic-scale African easterly waves are known to be linked to squall-line occurrence in the western Sahel (Fink and Reiner 2003) and lead to enhanced skill in cloudiness forecasts over West Africa (Söhne et al. 2008).
However, numerical weather prediction (NWP) models are known to have an overall poor ability to predict rainfall systems over northern Africa. For example, the gain in skill by improved initial conditions due to an enhanced upper-air observational network during the 2006 African Monsoon Multidisciplinary Analysis (AMMA) campaign (Parker et al. 2008) was lost in NWP models after 24 h of forecast time, potentially because of the models’ inability to predict the genesis and evolution of convective systems (Fink et al. 2011).
Given the substantial challenges involved in forecasting rainfall in northern Africa, one might hope that ensemble prediction systems (EPSs) provide an accurate assessment of uncertainties and a more useful forecast overall. An ensemble is a set of deterministic forecasts, created by changes in the initial conditions and/or the numerical representation of the atmosphere (Palmer 2002). With clear advantages of ensembles over single deterministic forecasts, EPSs are now run at all major NWP centers, which led to the creation of the TIGGE multimodel ensemble database (Bougeault et al. 2010; Swinbank et al. 2016). TIGGE contains forecasts from up to 10 global EPSs, with the ensemble of the European Centre for Medium-Range Weather Forecasts (ECMWF) being the most prominent and most important contributor (Hagedorn et al. 2012). To our knowledge, this present study is the first to rigorously and systematically assess the quality of ensemble forecasts for precipitation over northern tropical Africa. This is partly related to the fact that for this region ground verification data from rain gauge observations are infrequent on the Global Telecommunication System (GTS), the standard verification data source for NWP centers.
Despite many advances in the generation of EPSs, ensembles share structural deficiencies such as dispersion errors and biases. Statistical postprocessing addresses these deficiencies and realizes the full potential of ensemble forecasts (Gneiting and Raftery 2005). Additionally, it performs implicit downscaling from the model grid resolution to finer resolutions or station locations. The correction of systematic forecast errors is based on (distributional) regression techniques and, depending on the need of the user, several approaches are at hand (Schefzik et al. 2013; Gneiting 2014). Hamill et al. (2004) and Wilks (2009) proposed and extended logistic regression techniques, which yield probabilistic forecasts for the exceedance of thresholds. Here, we will for the first time explore whether established methods such as Bayesian model averaging (BMA; Raftery et al. 2005) and ensemble model output statistics (EMOS; Gneiting et al. 2005), which provide complete probabilistic quantitative precipitation forecasts, can improve precipitation forecasts for Africa.
The ultimate goal of this paper is to provide an exhaustive assessment of our current ability to predict rainfall over northern tropical Africa, considering the skill of raw and postprocessed forecasts from TIGGE. Any skill, if existing, would be expected to come from resolved large-scale forcing processes as mentioned above. We examine accumulation periods of 1–5 days for the monsoon seasons 2007–14 and verify against about 21 000 daily rainfall observations from 132 rain gauge stations and satellite-based gridded precipitation observations. Section 2 introduces the TIGGE ensemble, as well as the station and satellite-based observations used for verification. Section 3 describes our benchmark climatological forecast and methods for the evaluation of probabilistic forecasts and explains EMOS and BMA in detail. Results are presented in section 4, where we verify 1-day accumulated ECMWF precipitation forecasts against station observations. This analysis is performed in particular depth and serves as a fundamental exemplar. We also evaluate ECWMF ensemble forecasts at longer accumulation times and for spatial aggregations, before turning to the analysis of all TIGGE subensembles. Implications of our findings and possible alternative methods for forecasting precipitation over northern tropical Africa are discussed in section 5.
The TIGGE multimodel ensemble was set up as part of the THORPEX program in order to “accelerate improvements in the accuracy of 1-day to 2-week high-impact weather forecasts for the benefit of humanity” (Bougeault et al. 2010, p. 1060). Since its start in October 2006, up to 10 global NWP centers have provided their operational ensemble forecasts, which are accessible on a common 0.5° × 0.5° grid. Park et al. (2008) and Bougeault et al. (2010) discuss objectives and the setup of TIGGE, including the participating EPSs, in great detail. They also note early results using the TIGGE ensemble, while Swinbank et al. (2016) report on achievements accomplished over the last decade. Hagedorn et al. (2012) find that a multimodel ensemble composed of the four best participating TIGGE EPSs, which include the ECMWF ensemble, outperforms reforecast-calibrated ECMWF forecasts. For the evaluation of NWP precipitation forecast quality, TIGGE is the most complete and best available data source for the period 2007–14. Table 1 gives an overview of the nine participating TIGGE EPSs that provide accumulated precipitation forecasts.
In addition to the separate evaluation of each participating TIGGE subensemble, we construct a reduced multimodel (RMM) ensemble. For each of the seven subensembles available for the period 2008–13, the RMM ensemble uses the mean of the perturbed members, and the control run, and in the case of the ECMWF EPS, furthermore, the high-resolution run, as individual contributors. The RMM ensemble therefore consists of 15 members and, as postprocessing performs an implicit weighting of all contributions, a manual selection of subensembles as performed by Hagedorn et al. (2012) is not necessary.
Arguably, the ECMWF EPS is the leading example among the TIGGE subensembles (Buizza et al. 2005; Hagedorn et al. 2012; Haiden et al. 2012). It consists of a high-resolution (HRES) run, a control (CNT) run, and 50 perturbed ensemble (ENS) members. The HRES and CNT runs are started from unperturbed initial conditions and differ only in their resolution. The ENS members are started from perturbed initial conditions and have the same resolution as the CNT run. Molteni et al. (1996) and Leutbecher and Palmer (2008) describe the generation and properties of the ECMWF system in detail.
Despite multiple advances in satellite rainfall estimation, station observations of accumulated precipitation remain a reliable and necessary source of information. However, the meteorological station network in tropical Africa is sparse and clustered, and observations of many stations are not distributed through the GTS. The Karlsruhe African Surface Station Database (KASS-D) contains precipitation observations from a variety of networks and sources. Manned stations operated by African national weather services provide the bulk of the 24-h precipitation data. Due to long-standing collaborations with these services and African researchers, KASS-D contains many observations not available in standard, GTS-fed station databases. Within KASS-D, 960 stations have daily accumulated (usually 0600–0600 UTC) precipitation observations.
After excluding stations outside the study domain, and removing sites with less than 80% available observations in any of the monsoon seasons, the remaining 132 stations were subject to quality control, as described in the appendix, and passed these tests. Based on their rainfall climate (e.g., Fink et al. 2017) and geographic clustering, the stations were assigned to three regions, as indicated in Fig. 1, referred to in this paper as West Sahel, East Sahel, and Guinea Coast.
As NWP forecasts are issued for grid cells, the comparison of station observations against gridded forecasts is fraught with problems. To allow for an additional assessment of forecast quality without a gauge-to-gridbox comparison and for areas without station observations, we use satellite-based, gridded precipitation estimates. Based on recent studies, version 7 (and also version 6) of the Tropical Rainfall Measuring Mission (TRMM) 3B42 gridded dataset is regarded the best available satellite precipitation product, despite a small dry bias (Roca et al. 2010; Maggioni et al. 2016; Engel et al. 2017).
TRMM merges active measurements from the precipitation radar with passive, radar-calibrated information from infrared as well as microwave measurements (Huffman et al. 2007). Based on monthly accumulation sums, TRMM estimates are calibrated against nearby gauge observations. TRMM 3B42-V7 data are available on a 0.25° × 0.25° grid with 3-hourly temporal resolution.
c. Data preprocessing
Based on 1-day accumulated station observations, we derive 2–5-day accumulated precipitation observations by summing over consecutive 1-day observations. As these cover the period from 0600 UTC of the previous day to 0600 UTC of the considered day and as all TIGGE subensembles, except Météo-France (MF), have initialization times different from 0600 UTC, we use the most recent run available at that time and adapt accordingly. Specifically, for the subensembles initialized at 0000 UTC, we use the difference between the 30-h accumulated and the 6-h accumulated precipitation forecasts. For initialization at 1200 UTC, we use the difference between the 42-h accumulated and the 18-h accumulated precipitation forecasts, and for longer accumulation times, we extend this process correspondingly.
To obtain forecasts for a specific station location from gridded NWP forecasts, both bilinear interpolation as well as a nearest-neighbor approach are possible. We use the latter, implying that the forecast for the station is the same as the forecast for the grid cell containing the station. Especially for large gridbox sizes, bilinear interpolation may not be physically persuasive, and the nearest-neighbor approach is more compelling.
TRMM observations are temporally aggregated to the same periods as the station observations. As they do not cover the exact same periods, the first and last 3-h TRMM observations are weighted by 0.5. For evaluation on different spatial scales, NWP forecasts and TRMM observations are aggregated to longitude–latitude boxes of 0.25° × 0.25°, 1° × 1°, and 5° × 2°. As propagation of precipitation systems is a potential error source and in an environment with predominantly westward movement of these systems, the largest box is tailored to assess NWP forecast quality without this potential source of error.
d. Consistency between TRMM and station observations
In light of the dry bias of the TRMM observations, we evaluate the consistency of TRMM and station observations in our datasets. Specifically, we pair each station observation with the TRMM observation for the 0.25° × 0.25° box that contains the station location. Figure 2 shows contingency tables of TRMM and station observations above and below 0.2 mm, respectively, and two-dimensional frequency plots for TRMM and station observations above 0.2 mm, which is our threshold for the distinction between rain and no rain throughout the paper, as discussed in section 3b. For all regions the prevailing case is the one with both TRMM and the station reporting precipitation amounts below 0.2 mm. Among the disagreeing cases, the one with TRMM observing more than 0.2 mm and the station less than 0.2 mm is more frequent, coinciding with the intuition that a station is more likely to miss a precipitation event reported by TRMM than vice versa. The least squares regression lines in the two-dimensional frequency plots illustrate the dry bias of TRMM relative to station observations when both report rain. Overall, the agreement between the station and TRMM observations is fair. Disagreements of the magnitude and type seen here arise for reasons of differing coverage, spatial variability, and retrieval problems, among other concerns, and are compatible with the extant literature (see, e.g., Roca et al. 2010; Engel et al. 2017).
Probabilistic forecasts are meant to provide calibrated information about future events. To be of use, they should satisfy two properties. First, they should convey correct probabilistic statements, in that observations behave like random draws from the forecast distributions. This property is called calibration. Second, under all calibrated forecasts, sharper ones with lesser uncertainty are preferred.
a. Reference forecasts
For the assessment of raw and postprocessed ensemble forecast skill, the availability of a benchmark forecast is essential. Here, we introduce the concept of a probabilistic climatology that consists of the observations during the 30 years prior to the considered year at the considered day of the year and location. This can be understood as a 30-member observation-based ensemble forecast that represents the climatological distribution of rainfall at a given location and date, but does not incorporate dynamic information about the state of the atmosphere. We extend the probabilistic climatology by including observations in a ±2-day window around the considered day and refer to this as the extended probabilistic climatology (EPC). Our findings generally are insensitive to the range of the window being chosen from ±2 to ±20 days as shown in Fig. S1 in the online supplemental material.
Hamill and Juras (2006) note that pooling can lead to a deterioration when performed across data with differing climatologies, leading to a perceived, but incorrect improvement of assessed model forecast skill. In our case, however, neighboring daily climatologies are very similar, and the pooling is performed over a range of ±2 days only. EPC has better forecast quality than standard probabilistic climatologies (Fig. S1) and is used as benchmark in the following. As TRMM observations are available for the period 1998–2014 only, the TRMM-based EPC relies on this period without the considered verification year.
b. Assessing calibration: Unified probability integral transform histograms
Verification rank histograms and probability integral transform (PIT) histograms are standard tools for the assessment of calibration, and we refer the reader to Hamill (2001), Gneiting et al. (2007), and Wilks (2011) for in-depth discussions of their use and interpretation. In a nutshell, for calibrated probabilistic forecasts, rank and PIT histograms are uniform, U-shaped histograms indicate underdispersion, and skewed histograms mark biases.
For an ensemble forecast, the verification rank is the rank of the observation when it is pooled with the m ensemble members; clearly, this is an integer between 1 and m + 1. If k members predict no precipitation, and no precipitation is observed, the rank is randomly drawn between 1 and k + 1. For a probabilistic forecast in the form of a cumulative distribution function (CDF) F and a verifying precipitation accumulation , the PIT is the value of the forecast CDF evaluated at the observation. In the case of no precipitation, a value is randomly drawn between 0 and the forecast probability of no precipitation (Sloughter et al. 2007).
In the present study, we compare raw ensemble forecasts to postprocessed forecasts in the form of CDFs, and the TIGGE subensembles have varying numbers of members. We use the term probabilistic quantitative precipitation forecast (PQPF) to denote all these types of forecasts. To allow a compelling visual assessment of calibration in this setting, we introduce the notion of a unified PIT (uPIT). For a forecast in the form of a CDF, the uPIT is simply the PIT. For an ensemble forecast with m members, if the observation has rank i and this rank is unique, the uPIT is a random number from a uniform distribution between and . If k members predict no precipitation, and no precipitation is observed, the uPIT is a random number between 0 and . It is readily seen that for a calibrated PQPF the uPIT is uniformly distributed. Hereinafter, we use 20 equally spaced bins to plot uPIT histograms.
Our uPIT histograms focus on calibration regarding the forecasted precipitation amount. However, any PQPF induces a probability of precipitation (PoP) forecast for the binary event of rainfall occurrence at any given threshold value. We use a threshold of 0.2 mm to define rainfall occurrence irrespectively of the temporal and spatial aggregation at hand, with the results reported on hereinafter being insensitive to this choice.1 Reliability, the equivalent of calibration for probability forecasts of binary events, means that events declared to have probability p occur a proportion p of the time. This can be checked empirically in reliability diagrams, where the observed frequency of occurrence is plotted versus the forecast probability (e.g., Wilks 2011).
c. Proper scoring rules
For the comparative evaluation of predictive skill, we use proper scoring rules that assess calibration and sharpness simultaneously (Gneiting and Raftery 2007; Wilks 2011). Specifically, the continuous ranked probability score (CRPS) for a PQPF with CDF F and a verifying observation y is defined as
where 1 is an indicator function, equal to 1 if the argument is true and equal to 0 otherwise. From every PQPF, we can extract a deterministic forecast and compute its absolute error (AE). If the deterministic forecast is chosen to be the median of the forecast distribution, the AE can be interpreted as a proper scoring rule (Gneiting 2011; Pinson and Hagedorn 2012).2 Both the AE and the CRPS are negatively oriented, and they are reported in the unit of the observation (here, millimeters) and so can be compared directly. In fact, if the forecast distribution is a deterministic forecast, the CRPS reduces to the AE (Gneiting and Raftery 2007).
With the PoP being an essential component of a PQPF, the evaluation of PoP forecast quality by proper scoring rules is desirable and can be accomplished by means of the Brier score (BS; Brier 1950). For a probability forecast p for a binary event to occur, the negatively oriented BS is if the event occurs and if it does not occur.
It is well known that not only the BS, but many proper scoring rules for probability forecasts of binary events exist and that forecast rankings can depend on the choice of the proper scoring rule. However, every proper scoring rule admits a representation as a weighted average over so-called elementary scores or losses , which can be interpreted economically. Specifically, suppose that we are given a probability forecast p for a binary event and need to make a deterministic forecast of whether or not it will happen. If correct decisions do not incur any costs, a false alarm carries cost θ, and a missed event has cost for some , an optimal strategy is to predict that the event will happen when and to predict that it will not happen when .3 The elementary score is the loss incurred by this strategy. Ehm et al. (2016) advocate the use of so-called Murphy diagrams, which display, for each forecast considered, the mean elementary score as a function of . If a forecast receives a lower elementary score than another for every θ, then it is preferable for any decision-maker and receives lower scores under just any proper scoring rule (Ehm et al. 2016). Interestingly, the area under a forecast’s graph in a Murphy diagram equals half its mean BS, and the height of the graph at equals half the misclassification rate when false alarms and misses incur equal costs.
A popular graphical tool for the assessment of discrimination ability in binary prediction problems is the receiver operating characteristic (ROC) diagram; for details of which we refer the reader to section 8.4.7 of Wilks (2011). In a nutshell, for any given probability forecast, the ROC curve is a plot of the hit rate versus the false alarm rate as a function of the cutoff value for the binary decision. The area under the ROC curve (AUC) is commonly used as a measure of resolution and discrimination skill, with higher values being preferable. In contrast to Murphy diagrams, which consider both reliability and discrimination and assess the actual value of a forecast in decision-making, ROC curves and AUC values are insensitive to (any lack of) reliability and, therefore, reflect potential skill and value only (Wilks 2011, p. 346).
d. Statistical postprocessing
Statistical postprocessing addresses structural deficiencies of NWP model output. Here, we use the well-established methods of EMOS (Gneiting et al. 2005; Scheuerer 2014) and BMA (Raftery et al. 2005; Sloughter et al. 2007) to correct for systematic errors in ensemble forecasts of precipitation accumulation.
In this section, we review these methods with a focus on the 52-member ECMWF EPS, and we denote the values of its HRES, CNT, and ENS members by , , and , respectively. We write for the mean of the ENS members, for the fraction (out) of (all 52) members that predict no precipitation, and denote the observed precipitation accumulation by y. Adaptations of the postprocessing schemes to the other TIGGE subensembles and the RMM ensemble are straightforward.
1) Ensemble model output statistics
The idea of the EMOS approach is to convert an ensemble forecast into a parametric distribution, based on the ensemble forecast at hand (Gneiting et al. 2005). Scheuerer (2014) introduced an EMOS approach for precipitation accumulation that relies on the three-parameter family of left-censored generalized extreme value (GEV) distributions. The left-censoring allows for a point mass at zero and the shape parameter for flexible skewness in positive precipitation accumulations.
Briefly, the EMOS predictive distribution based on the ECMWF ensemble is a left-censored GEV distribution. The location parameter of this distribution is a linear function of , , , and , and its scale parameter is a linear function of the ensemble mean difference, which is a more robust measure of ensemble spread than the standard deviation. While all parameters are estimated from training data, the shape parameter does not link to the ensemble values (Scheuerer 2014).
For illustration, Fig. 3a shows an EMOS postprocessed forecast distribution for 5-day accumulated precipitation at Ouagadougou, Burkina Faso. The 52 raw ECMWF ensemble members are represented by blue marks; they include 11 values in excess of 200 mm, with the CNT member being close to 500 mm. The ensemble forecast at hand informs the statistical parameters of the EMOS postprocessed forecast distribution, which includes a tiny point mass at zero, and a censored GEV density for positive precipitation accumulations, with the 90th percentile being at 174 mm.
2) Bayesian model averaging
A BMA predictive distribution is a weighted sum of component distributions, each of which depends on a single ensemble member. For the ECMWF ensemble, the BMA method for precipitation accumulation proposed and studied by Sloughter et al. (2007) and Fraley et al. (2010) implies a statistical model of the form
with nonnegative weights , , and that sum to 1, and reflects the members’ performance during the training period.4 Each of the component distributions, , , and , contains a point mass at zero and a density for positive accumulations. The point mass at zero specifies the probability of no precipitation and is estimated in a logistic regression model, where the cube root of the member forecast and a binary indicator of the member forecast being zero are used as predictor variables. The specification for positive amounts is based on a gamma density for the cube-root-transformed precipitation amount, with a mean that is a linear function of the cube-root-transformed member forecast and a variance that is a linear function of the member forecast. While the statistical coefficients for the mean of the gamma model are estimated for , , and separately, the coefficients for the variance of the gamma model are shared. To obtain the BMA predictive distribution for the linear precipitation accumulation in millimeters, rather than the cube root thereof, a backtransformation is applied as described by Sloughter et al. (2007).
Figure 3b shows such a BMA postprocessed forecast distribution for the aforementioned forecast case at Ouagadougou. The postprocessed distribution involves a point mass of about 0.01 at zero, and a mixture of power-transformed gamma densities for positive accumulations, with the 90th percentile being at 141 mm. In this example, the BMA and EMOS postprocessed distributions are sharper than the raw ECMWF ensemble, and nevertheless the verifying accumulation is well captured.
Adaptations to the other ensembles considered in this paper are straightforward, as described by Fraley et al. (2010). For example, in the case of the RMM ensemble each of the 15 contributors receives its own component distribution, BMA weight, logistic regression coefficients for the probability of no precipitation, and statistical parameters for the gamma mean model, whereas the coefficients for the gamma variance model are shared.
3) Estimation of statistical parameters
Postprocessing techniques such as EMOS and BMA rely on statistical parameters that need to be estimated from training data, comprising forecast–observation-pairs either from the station or TRMM pixel at hand, or from all stations or applicable TRMM pixels within the considered region, and typically from a rolling training period consisting of the n most recent days for which data are available at the initialization time. We employ the regional approach with a rolling training period of days, which yields superior results, consistent with the literature (e.g., Thorarinsdottir and Gneiting 2010). As shown in Figs. S2–S5 in the supplementary material, our findings are insensitive to the choice of n when using training periods between 20 and 50 days. The local approach requires longer training periods and (in experiments not shown here) yields very similar results then.
For EMOS, parameter estimation is based on CRPS minimization over the training data, which is computationally efficient, as closed expressions for the CRPS under GEV distributions are available (Scheuerer 2014). For BMA, we employ maximum likelihood estimation, implemented via the expectation-maximization (EM) algorithm developed by Sloughter et al. (2007). All computations were performed in R (R Development Core Team 2017) based on the ensembleBMA package (Fraley et al. 2011) and code supplied by M. Scheuerer.
Our annual evaluation period ranges from 1 May to 15 October, covering the wet period of the West African monsoon. The assessment of ECMWF ensemble forecasts is based on monsoon seasons 2007–14, and for the other TIGGE subensembles we restrict our evaluation according to availability as indicated in Table 1.
For verification against station observations, this yields more than 3000, 6000, and 12 000 forecast–observations pairs per monsoon season in East Sahel, West Sahel, and Guinea Coast. For verification against TRMM observations, we use 30 randomly chosen, nonoverlapping boxes per region at 0.25° × 0.25° and 1° × 1° aggregation and eight sites per region for 5° × 2° longitude–latitude boxes. This covers substantial parts of the study region and results in about 5000 forecast–observation pairs per monsoon season at the smaller aggregation levels and well over 1000 pairs at our highest level.
In section 4a, we study the skill of 1-day accumulated ECMWF raw and postprocessed ensemble precipitation forecasts in detail. Sections 4b and 4c present results and highlight differences for longer accumulation times and spatially aggregated forecasts. Section 4d turns to results for all TIGGE subensembles, and we investigate the gain in predictability through intermodel variability using the RMM ensemble. In our uPIT histograms and reliability diagrams, we show results for the last available monsoon season only (2014), given that operational systems continue to be improving (Hemri et al. 2014).
a. 1-day accumulated ECMWF forecasts
Figure 4 shows uPIT histograms for 1-day accumulated raw and postprocessed ECMWF ensemble and EPC forecasts over West Sahel, East Sahel, and Guinea Coast. The histograms for the raw ensemble indicate strong underdispersion as well as a wet bias (Figs. 4a–c). At Guinea Coast, about 56% of the observations are smaller than the smallest ensemble member, a result that is robust across monsoon seasons. EMOS and BMA postprocessed forecasts generally are calibrated (Figs. 4g–l), as is EPC (Figs. 4d–f), except that the tails of the EMOS predictive distributions are too light. Statistical postprocessing also corrects for the systematically too-high PoP values issued by the raw ECMWF ensemble. As shown in Fig. 5, EMOS and BMA postprocessed PoP forecasts are reliable, but are hardly ever higher than 0.70. Generally, the postprocessed PoP forecasts have reliability and resolution similar to EPC.
Table 2 shows the mean BS, mean CRPS, and mean absolute error (MAE) for the various forecasts and regions, with the scores being averaged across monsoon seasons 2007–14. We use a simple procedure to check whether differences in skill are stable across seasons. If a method has a higher (worse) mean score than EPC during all eight seasons, we mark the score with − −; if it is judged to be worse during seven seasons, we use a −. Similarly, if a method has a smaller (better) mean score than EPC during all seasons, we mark the score as + +; if it performs better during seven seasons, we label it as + in Table 2. Viewed as a (one sided) statistical test of the hypothesis of predictive skill equal to EPC, the associated tail probabilities or p values are and , respectively. Clearly, the raw ECMWF ensemble underperforms relative to EPC, with the − − designations used throughout, and the EMOS and BMA postprocessed forecasts perform at about the same level as EPC. For the BS, the similar performance of postprocessed and EPC forecasts stems from the fact that not only do postprocessed and EPC forecasts show similar reliability but also similar resolution, as seen from the inset histograms in Figs. 5d–l.
The Murphy diagrams in the top row of Fig. 6 corroborate these findings. For 1-day precipitation occurrence, decision-makers will mostly prefer the climatological reference EPC over the raw ECMWF ensemble, and only some decision-makers will have a slight preference for EMOS or BMA postprocessed forecasts, as compared to EPC. Further light on these issues is shed by the ROC diagrams in the bottom row of Fig. 6. EMOS and BMA PoP forecasts can be interpreted as recalibrated raw ensemble probabilities, and so it is not surprising that for West Sahel and East Sahel, raw and postprocessed forecasts show essentially the same level of discrimination skill, at a level that is slightly superior to EPC. For Guinea Coast, EMOS and BMA have considerably higher AUC values than the raw ensemble, due to the extreme concentration of the raw ensemble probabilities at very high levels, as illustrated in Fig. 5c. In contrast, the Murphy curves are sensitive to calibration and show marked differences between raw and postprocessed forecasts. Overall, these are sobering results, as they suggest that over northern tropical Africa the ECMWF 1-day accumulated precipitation forecasts are hardly of practical use.
What could be possible reasons for the poor performance of the raw forecasts? A number of recent studies have shown that the use of convective parameterization is a first-order error source for realistically representing precipitation, cloudiness, wind, and even the regional-scale monsoon circulation in West Africa together with their respective diurnal cycles (e.g., Pearson et al. 2014; Marsham et al. 2013; Birch et al. 2014; Pantillon et al. 2015). Based on these results, and given that all of the models we investigate use convective schemes, we suspect this aspect to be a major cause of the poor performance we find. A visual comparison of 1-day accumulated precipitation forecasts from ECMWF HRES and TRMM shows that rainfall structures in the model tend to be too widespread and too light, lacking signs of mesoscale organization (see Fig. S6 for an example). Inspection of raw ensemble data suggests that, for both station and TRMM observations, agreement between the forecasts and observations is modest at best. Many observed precipitation events are either not predicted at all, are strongly underpredicted, or are predicted by (almost) all ensembles members (with varying amounts of precipitation), yet are not observed (see Fig. S7 for an illustrative example). In particular, the second point is an indication of a misrepresentation of real-world squall-line systems by the model.
b. Longer accumulation times
One might expect NWP precipitation forecasts to improve relative to EPC at longer accumulation times, as the main focus in forecasting shifts from determining time and location of initiation and subsequent propagation of convection toward determining regions with enhanced or reduced activity, based on large-scale conditions. Longer lead times might also lead to growth in differences between perturbed members and, thus, reduce the raw ensemble underdispersion.
The uPIT histogram in Fig. 7a indicates only slight, if any, improvement in calibration for raw ECMWF 5-day accumulated precipitation forecasts over West Sahel, and the results for the other regions are similar (not shown). Raw ensemble reliability improves at longer accumulation times, verified against either station observations in Fig. 7b, or 5° × 2° TRMM observations in Figs. 7c and 7d, though at a loss of resolution.
Table 3 uses the same settings as in Table 2, but the scores are now for 5-day accumulated precipitation. The raw ECMWF ensemble still underperforms relative to EPC. The EMOS and BMA postprocessed forecasts outperform EPC only slightly, with the differences in scores being small and typically not being stable across monsoon seasons. Despite the change in the underlying forecast problem, even postprocessed ECMWF ensemble forecasts are generally not superior to EPC.
c. Spatially aggregated observations
For the assessment of forecast skill at larger spatial scales, we focus on ECMWF raw and BMA postprocessed ensemble forecasts over West Sahel, evaluated by the Brier score and CRPS. This is due to the similarities in CRPS and MAE results, better performance of BMA compared to EMOS in many instances, and results for West Sahel that are as good for BMA postprocessed forecasts as for East Sahel, and better than for Guinea Coast.
The use of spatially aggregated TRMM observations avoids problems of point-to-pixel comparisons, and at higher aggregation we can assess the forecast quality with minimal propagation error. The dry bias of TRMM disadvantages the raw ensemble compared to EPC and postprocessed forecasts, but does not hinder assessments regarding systematic forecast errors. As illustrated in Fig. 7c, 1-day PoP forecasts from the raw ECMWF ensemble remain unreliable even at the 5° × 2° gridbox scale. It is only under large scales and longer accumulation times simultaneously, when precipitation occurs almost invariably, that raw ensemble PoP forecasts become reliable (Fig. 7d).
Table 4 shows mean Brier and CRPS scores at various spatial aggregations for 1-day precipitation accumulation, verified against TRMM observations. The raw ECMWF ensemble forecast is inferior to EPC at all resolutions and in every single region and season. BMA postprocessed forecasts outperform EPC across aggregation scales, and in every single region and season, but the improvement relative to EPC remains small.
d. TIGGE subensembles and RMM ensemble
In addition to the ECMWF EPS, which we have studied thus far, the TIGGE database contains several more operational subensembles, as listed in Table 1. Figure 8 shows uPIT histograms for the various subensembles and the RMM ensemble for 1-day accumulated precipitation forecasts over West Sahel. All TIGGE subensembles exhibit underdispersion and wet biases, though to strongly varying degrees.
Figure 9 displays Brier and CRPS skill scores relative to EPC for raw and BMA postprocessed TIGGE subensemble and RMM ensemble forecasts during 2007–14, verified against station observations. All raw ensembles underperform relative to EPC, in part drastically so. For most subensembles, a temporal improvement in skill is visible, with the monsoon seasons of 2011–14 revealing higher skill than those during 2007–10. Postprocessing by BMA increases forecast quality. The ECMWF, Korea Meteorological Administration (KMA), NCEP, and UKMO ensembles yield the best postprocessed forecasts, exhibiting small positive skill relative to EPC for most monsoon periods. The BMA postprocessed RMM ensemble outperforms all subensembles as well as EPC, but the improvement is small. As shown in Fig. 10, the mean perturbed forecasts from the ECMWF, UKMO, and NCEP ensembles are the top three contributors to the BMA postprocessed RMM forecast.
In further experiments, we have studied raw and postprocessed TIGGE subensemble and RMM ensemble forecasts at accumulation times of up to 5 days and spatial aggregations of up to 5° × 2° grid boxes in TRMM. Our findings generally remain unchanged. The raw ensemble forecasts never reach the quality of the climatological reference EPC. After postprocessing with BMA, the ECMWF ensemble typically becomes the best-performing TIGGE subensemble, showing slightly better scores than EPC when verified against TRMM observations, at all spatial aggregations. The BMA postprocessed RMM forecast depends heavily on the ECMWF mean perturbed forecast and is superior to both EPC and the BMA postprocessed subensemble.
In a first-ever thorough verification study, the quality of operational ensemble precipitation forecasts from different NWP centers was assessed over northern tropical Africa for several years, accumulation periods, and for station and spatially aggregated satellite observations. All raw ensembles exhibit calibration problems in the form of underdispersion and biases and are unreliable at high PoP forecast values. They have lower skill than the climatological reference EPC for the prediction of occurrence and amount of precipitation, with the underperformance being stable across monsoon seasons.
After correcting for systematic errors in the raw ensemble through statistical postprocessing, the ensemble forecasts become reliable and calibrated, but only a few are slightly superior to EPC. While ramifications and developments of both EMOS and BMA might be feasible (see, e.g., Fortin et al. 2006; Scheuerer and Hamill 2015), and training sets could be augmented by using reforecast data (e.g., Di Giuseppe et al. 2013), the respective benefits are likely to be incremental at this time, though as the raw ensemble performance improves, they might become substantial. Not surprisingly, forecast skill tends to be highest for long accumulation times and large spatial aggregations. Overall, raw ensemble forecasts are of no use for the prediction of precipitation over northern tropical Africa, and even EMOS and BMA postprocessed forecasts have little added value compared to EPC.
What are the reasons for this rather disappointing level of performance for the state-of-the-art global EPSs? For 1-day accumulated precipitation forecasts, the ability of an NWP model to resolve the details of convective organization is essential. As all global EPSs use parameterized convection, this clearly limits the forecast skill. The fact that even postprocessed 1-day accumulated ensemble forecasts exhibit no skill relative to EPC implies that ensembles cannot translate information on the current atmospheric state (e.g., tropical waves or influences from the extratropics) into meaningful impacts regarding the occurrence or amount of precipitation. This is robust for verification against station as well as satellite observations and cannot, therefore, be explained by propagation errors.
For longer accumulation times and larger spatial aggregations, the large-scale circulation has a much stronger impact on convective activity, which should weaken the limitations through convective parameterization. The skill of 5-day accumulated precipitation forecasts, however, increases only slightly, if at all, compared to 1-day accumulated forecasts. The most likely reason for this is that squall lines have feedbacks on the large-scale circulation, which are not realistically represented in global NWP models either. Marsham et al. (2013) find that the large-scale monsoon state in (more realistic) simulations with explicit convection differs quite markedly from runs with parameterized convection, even when using the same resolution of 12 km. In the explicit-convection simulation, greater latent and radiative heating to the north weakens the monsoon flow, delays the diurnal cycle, and convective cold pools provide an essential component to the monsoon flux. We suspect that some or all of these effects are misrepresented in global EPS forecasts.
The fact that EPS precipitation forecasts are so poor over northern tropical Africa is a strong demonstration of the complexity of the underlying forecast problem. An interesting question within this context is whether poor predictability in the tropics is unique to northern Africa with its strongly organized, weakly synoptically forced rainfall systems.
Furthermore, the lack of skill motivates complementary approaches to predicting precipitation over this region. Little et al. (2009) compare operational NCEP ensemble, climatological, and statistical forecasts for stations in the Thames Valley, United Kingdom. They note that NCEP forecasts outperform climatological forecasts, but demonstrate that statistical forecasts, solely based on past observations, can outperform NCEP forecasts by exploiting spatiotemporal dependencies. These also exist over northern tropical Africa and some additional predictability may stem from large-scale drivers such as convectively coupled waves. Fink and Reiner (2003) note a coupling of the initiation of squall lines to African easterly waves and Wheeler and Kiladis (1999) the influence of large-scale tropical waves, such as Kelvin and equatorial Rossby waves or the Madden–Julian oscillation, on convective activity. Pohl et al. (2009) confirm the relation between the Madden–Julian oscillation and rainfall over West Africa, and Vizy and Cook (2014) demonstrate an impact of potential extratropical wave trains on Sahelian rainfall. Statistical models based on spatiotemporal characteristics of rainfall and extended by such large-scale predictors seem a promising approach for improving precipitation forecasts over our study region, and we expect such forecasts to outperform climatology. This approach will be explored in future work.
As discussed in section 4a, we suspect convective parameterization to be a major cause of the low quality of the model-based forecasts here. Therefore, it would be interesting to test ensembles of convection-permitting NWP model runs, ideally in combination with ensemble data assimilation, but the computational costs are high, and it will take time until a multiyear database will become available for validation studies. Alternatively, it could be tested whether systematic improvements to convection schemes (e.g., Bechtold et al. 2014) do in fact positively impact ensemble forecast quality. Given the growing socioeconomic impact of rainfall in northern tropical Africa with its rain-fed agriculture, statistical and statistical–dynamical approaches should be fostered in parallel in order to improve the predictability of rainfall in this region.
The research leading to these results has been accomplished within project C2 “Prediction of wet and dry periods of the West African Monsoon” of the Transregional Collaborative Research Center SFB/TRR 165 “Waves to Weather” funded by the German Science Foundation (DFG). TG is grateful for support by the Klaus Tschira Foundation. The authors also thank various colleagues and weather services that have over the years contributed to the enrichment of the KASS-D database; special thanks go to Robert Redl for creating the underlying software. We thank Sebastian Lerch for helpful discussions and Alexander Jordan and Michael Scheuerer for providing R code, and we are grateful to Tom Hamill, Ken Mylne, and an anonymous referee for constructive suggestions.
Quality Control for Precipitation Observations within KASS-D
Rainfall exhibits extremely high spatial and temporal variability, which hinders automated quality checks applicable to other meteorological variables such as temperature or pressure. For precipitation, Fiebrich and Crawford (2001) note only a range and a step test. The global range of station-observed 1-day accumulated precipitation is from 0 to 1825 mm. All KASS-D observations passed this test. The step test checks if the difference of neighboring 5-min accumulated precipitation is smaller than 25 mm. For 1-day accumulated precipitation, tests of this type are not meaningful, nor are the persistence tests used by Pinson and Hagedorn (2012) for wind speed.
However, the site-specific climatological distributions of precipitation accumulation should be right skewed (i.e., the median should be smaller than the mean), and in the tropics they should have a point mass at zero (Rodwell et al. 2010). As noted, we only consider stations with more than 80% available observations in any of the monsoon seasons, and all 132 stations thus selected passed these tests.
Denotes content that is immediately available upon publication as open access.
This article is included in the Waves to Weather (W2W) Special Collection.
Specifically, we checked thresholds from 0.0 to 1.0 mm, with minimal differences in findings. Exemplary results are available from the authors upon request.
For this desirable interpretation to be valid, the deterministic forecast needs to be chosen as the median of the forecast distribution. For the mathematical argument and technical details see the review article by Gneiting (2011) and the references therein.
When , either action can be taken. These results are elementary and well known; see Ehm et al. (2016) and the references therein.
Within this context, we take the chance to correct a typographical error in Fraley et al. (2010), where the factor is missing in between the summation signs in their Eq. (5).