Subseasonal precipitation prediction for Africa: forecast evaluation and sources of predictability

: This paper evaluates subseasonal precipitation forecasts for Africa using hindcasts from three models (ECMWF, UKMO, and NCEP) participating in the Subseasonal to Seasonal (S2S) prediction project. A variety of veri-ﬁcationmetricsareemployedto assessweeklyprecipitation forecastqualityat leadtimesof one to four weeksahead(weeks 1–4) during different seasons. Overall, forecast evaluation indicates more skillful predictions for ECMWF over other models and for East Africa over other regions. Deterministic forecasts show substantial skill reduction in weeks 3–4 linked to lower association and larger underestimation of predicted variance compared to weeks 1–2. Tercile-based probabilistic forecasts reveal similar characteristics for extreme categories and low quality in the near-normal category. Although discrimination is low in weeks 3–4, probabilistic forecasts still have reasonable skill, especially in wet regions during particular rainy seasons. Forecasts are found to be overconﬁdent for all weeks, indicating the need to apply calibration for more reliable predictions. Forecast quality within the ECMWF model is also linked to the strength of climate drivers’ tele-connections,namely,ElNi ñ o–SouthernOscillation,IndianOceandipole,andtheMadden–Julianoscillation.Theimpactof removing all driver-related precipitation regression patterns from observations and hindcasts shows reduction of forecast quality compared to including all drivers’ signals, with more robust effects in regions where the driver strongly relates to precipitation variability. Calibrating forecasts by adding observed regression patterns to hindcasts provides improved forecast associations particularly linked to the Madden–Julian oscillation. Results from this study can be used to guide decision-makers and forecasters in disseminating valuable forecasting information for different societal activities in Africa.


Introduction
Delivering useful subseasonal forecasts (between 2 weeks and 2 months ahead) remains a great challenge for operational forecasting centers, as this time scale is too long to retain much of the influence of the atmospheric initial conditions and sufficiently short to be dominated by the forced boundary conditions.The lack of subseasonal precipitation forecast quality over many regions worldwide has been identified by evaluating near real-time forecasts and hindcasts made available by the Subseasonal to Seasonal (S2S) prediction project (Vitart et al. 2017).The target goal of the S2S project is to address the predictability gap between medium-range weather predictions and seasonal climate predictions to improve forecast quality on subseasonal time scales for a range of applications, for instance, agriculture, water resource management, and other socioeconomic activities.
The S2S database has been used to evaluate subseasonal precipitation forecasts on a weekly basis (Vigaud et al. 2017a,b;Coelho et al. 2018;de Andrade et al. 2019; among others).For example, de Andrade et al. ( 2019) evaluated weekly precipitation hindcasts from all models participating in the S2S project, finding best agreement with precipitation observations during the first two weeks lead and worst quality in subsequent weeks, especially over extratropical regions.Weekly precipitation predictions were also verified over summer monsoon regions of the Northern Hemisphere and the East Africa-West Asia sector (Vigaud et al. 2017b(Vigaud et al. , 2018)), both showing worst quality for longer lead times (i.e., beyond two weeks lead).
Despite the fact that there is poorer precipitation forecast quality within S2S models after the first two weeks lead, recent studies have analyzed the role played by particular sources of subseasonal predictability, such as El Niño-Southern Oscillation (ENSO) and the Madden-Julian oscillation (MJO), in modulating the quality of precipitation forecasts.Li and Robertson (2015) evaluated weekly precipitation forecasts as a function of ENSO and MJO metrics supporting the concept that particular climate drivers' conditions can promote better subseasonal predictions.Moreover, de Andrade et al. (2019) found a reduction in forecast quality after removing ENSO-and MJOrelated precipitation patterns from weekly forecasts.Other drivers could also affect the quality of subseasonal precipitation predictions, for instance, tropical-extratropical interactions (Vigaud et al. 2019) and stratosphere-troposphere coupling (Domeisen et al. 2020).For a more comprehensive description of relevant drivers of subseasonal predictability, the reader is referred to Mariotti et al. (2020).
Among the many efforts to improve understanding of subseasonal forecast skill over the past years, one important aspect is forecast verification (Coelho et al. 2019).Several studies have identified areas where precipitation forecast quality of S2S models could be refined (e.g., Vigaud et al. 2018;de Andrade et al. 2019).However, only few of those studies have employed detailed verifications to analyze the attributes of forecast quality defined in Murphy (1993).Since a single verification score is unable to evaluate different attributes of forecast quality, an assessment of a set of metrics is required to help obtain a fully comprehensive overview of S2S models' ability to predict subseasonal precipitation (Coelho et al. 2018).Here, weekly precipitation forecast quality from three S2S models is investigated over the African continent assessing the attributes of deterministic and probabilistic forecast quality using a variety of metrics.Such a comprehensive exploration of subseasonal forecast quality not only has the potential to advance the scientific understanding, but also provide support to forecasters and decision-makers in different sectors of society, improving early warning systems and lives and livelihoods of millions of people in Africa.Furthermore, an evaluation of how well models capture the relationships of important climate drivers with African precipitation and its contribution to the quality of forecasts also deserves investigation to deepen our knowledge of the sources of subseasonal predictability.Thus, this study provides an unprecedented weekly precipitation forecast evaluation for Africa, examining different ensemble prediction systems and key drivers modulating high-impact weather events.
Section 2 outlines the datasets and methods employed to evaluate the attributes of forecast quality.Section 2 also provides a description of the methodology used to analyze particular sources of subseasonal predictability and their links to African precipitation forecast quality.The results of deterministic and probabilistic forecast verification are presented in section 3, followed by an assessment of key driver-dependent forecast quality in section 4. A summary and conclusions are given in section 5.

a. S2S hindcasts
Precipitation hindcasts from the S2S database were evaluated for the European Centre for Medium-Range Weather Forecasts (ECMWF), the Met Office (UKMO), and the National Centers for Environmental Prediction (NCEP) models.These hindcasts have different configurations such as forecast length, spatial resolution, frequency, period, ensemble size, and coupling effects (Table 1); see Vitart et al. (2017) for further details.Moreover, ECMWF and UKMO hindcasts are produced gradually by updating their model versions according to near real-time forecasts, whereas in the NCEP model hindcasts have a fixed date for a given model version.We analyzed ECMWF and UKMO hindcasts corresponding to model version dates of the year 2018.
Four start dates per month were chosen based on UKMO initializations (the 1st, 9th, 17th, and 25th).We selected the closest start date for certain nonmatching ECMWF initializations.This discrepancy regarding models' initialization restricted a multimodel evaluation.To have a fair intercomparison among models, three perturbed members, extracted from 1-day lag after initializations, were added to the NCEP ensemble size.This procedure allowed all models having at least seven ensemble members.Since the subseasonal time scale is beyond the weather prediction limit, a weekly time frame was employed for more adequately representing the subseasonal forecast range.Weekly precipitation was obtained considering four accumulation lead times: days 5-11 (week 1), 12-18 (week 2), 19-25 (week 3), and 26-32 (week 4).

b. Observational dataset
Hindcasts were verified using data from the Global Precipitation Climatology Project (GPCP), version 1.2 (Huffman et al. 2001).Daily GPCP precipitation is produced by the National Aeronautics and Space Administration (NASA), by blending precipitation estimates from gauge stations and satellite measurements, and sourced from the National Center for Atmospheric Research (NCAR).GPCP data were linearly interpolated to the 1.58 spatial resolution to match models regridded resolution made available in the S2S database and used to calculate accumulated precipitation for the weekly periods defined in section 2a.

c. Forecast verification framework
Forecast verification is a process to evaluate the robustness of an ensemble prediction system, providing a guide for identifying its strengths and weaknesses when examining the joint distribution of forecasts and observations.A common forecast verification practice consists of assessing the attributes of deterministic and probabilistic forecast quality by computing metrics depending on forecast type (Coelho et al. 2019).Deterministic forecast verification metrics compare quantitative forecasts to observations (e.g., rainfall amounts in millimeters).The evaluation of deterministic forecasts is most often conducted by analyzing the ensemble mean to verify the value of using a set of perturbed initial conditions rather than a single unperturbed forecast.Probabilistic forecast verification metrics compare forecast probabilities to observations (e.g., probability of above-normal rainfall).Probabilities are usually examined in different categories and obtained by taking the proportion of the ensemble members falling in ranges defined by certain predefined thresholds (e.g., 33rd or 67th percentiles).Specifically, binary observations are used to assess probabilistic forecasts of dichotomous variables with two possible outcomes (e.g., rain or no rain events).
A variety of deterministic and probabilistic forecast verification metrics were used to evaluate the attributes of forecast quality defined in Murphy (1993) where N denotes the sample size, F i the forecast totals, and O i the observation totals.
d Association describes the linear relationship between deterministic forecasts and observations.Forecasts with good association are highly positively correlated with observations.The Pearson's correlation coefficient [R; (2)] is a common metric of association indicating the direction of deviations [R close to 1 (21) indicates strong positive (negative) association]: where F 0 i denotes the forecast anomalies and O 0 i the observed anomalies.
d Accuracy is the difference between forecasts and observations, providing the magnitude of forecast errors.Thus, the lower the difference, the better the accuracy.The mean square error [MSE; (3)] assesses deterministic errors, whereas the ranked probability score [RPS; (4)] evaluates probabilistic errors for more than two probability categories: where K is the number of categories, P iz is the cumulative forecast probability and O iz is the cumulative binary observation for occurrence (O iz 5 1) and nonoccurrence (O iz 5 0) of an event.RPS is a generalized version of the Brier score (BS) for two categories.
d Skill evaluates the accuracy of forecasts relative to some reference forecast, such as observed climatology.The skill score [SS; (5)] indicates forecasts more (less) skillful than the reference when is positive (negative): where S f is the score for forecasts and S r the score for the reference forecast.A perfect SS would be equal to 1.The SS assesses deterministic and probabilistic skill using the MSE (3) and the RPS (4), resulting in the mean square skill score [MSSS; (6)] and ranked probability skill score [RPSS; (7)], respectively: where MSE f and MSE r are the MSEs for forecasts (3) and for a reference forecast, respectively.Here, RPS f is the RPS for forecasts (4) and RPS r the RPS for a reference forecast.RPSS is sensitive to ensemble size and a negative bias is introduced for small ensemble sizes (Müller et al. 2005).To overcome this sensitivity, we use a debiased (discrete) RPSS [RPSS D ; (8)] derived for any ensemble size and probability category by adding a bias correction term on the reference forecast (Weigel et al. 2007) rather than including a correction term by randomly resampling from climatology (Müller et al. 2005): For equiprobable K categories, the correction term D is defined as D 5 (1/M)[(K 2 2 1)/6K], where M is the ensemble size.
d Discrimination is the ability of forecasts at discerning between different observed outcomes.For dichotomous forecasts, it is the ability of a forecast at distinguishing between occurrence and nonoccurrence of events, for instance precipitation falling in a tercile category.The relative operating characteristic (ROC) diagram and the area under the curve (AUC) are metrics adopted for assessing discrimination and providing useful information for decision-makers.The ROC diagram for a given event is obtained by plotting the hit rate against the false alarm rate computed at different probability thresholds.The hit rate is the ratio between the number of correct forecasts of the event and the total number of occurrences of the event, whereas the false alarm rate is the ratio between the number of noncorrect forecasts of the event and the total number of nonoccurrences of the event.
The diagonal line in the ROC diagram is where the hit rate equals the false alarm rate and indicates no discrimination.
Better discrimination is found when the ROC curve is above the diagonal line and close to the upper-left corner, indicating the hit rate exceeds the false alarm rate.The AUC is computed from the ROC diagram joining the points associated with each threshold to form a series of trapezoids and adding their areas.The AUC is interpreted as a score, indicating no (perfect) discrimination when equal to 0.5 (1) (Kharin and Zwiers 2003a).
d Reliability measures the conditional bias in forecast probabilities, indicating the extent of their over or underconfidence.A reliable forecasting system is identified for all probability thresholds when the probabilistic outcomes are equal to the observed frequencies.For example, if a system is reliable, we should expect an event to occur 60% of the times the system issues a 60% probability of occurrence.
Resolution assesses the degree of variability in the observed frequencies at different forecast probabilities.Sharpness evaluates the ability of forecasts to predict extreme probabilities.The attributes diagram (AD) is a useful way to verify probabilistic forecasts by summarizing the ability of ensemble prediction systems to represent the attributes of reliability, resolution, and sharpness.The AD is constructed by plotting the observed frequency for different forecast probabilities.Stratification is done by binning data into different probability thresholds.The diagonal line in the AD indicates perfect reliability in which the forecast probabilities are equal to the observed frequency.The horizontal line represents the observed climatological frequency, indicating no resolution.The line of no-skill can be found at the midpoint between the perfect reliability and observed frequency climatology.Probabilities falling into the area between the no-skill line and the vertical line replicating the horizontal line contribute to increase skill as demonstrated by the decomposition of the RPS/BS (Murphy 1972(Murphy , 1973)).Histograms provide information on the frequency of forecasts in each bin and the degree of sharpness.
Evaluation was performed over the African continent and adjacent regions to explore forecasting quality not only over land, but also oceanic areas where important atmospheric systems, such as the intertropical convergence zone (ITCZ), are located.To analyze the regional performance of the models, verification metrics were computed over four geographically selected African regions (Fig. 1), referred to as West African Monsoon (WAM), Equatorial West Africa (EWA), Equatorial East Africa (EEA), and Southern Africa (SA).These locations were chosen to represent different climate regions with particular rainy seasons (Zaitchik 2017).
For deterministic forecasts, ensemble mean anomalies were obtained after subtracting the ensemble mean climatology computed through a leave-one-out cross-validation method without considering the verified year.Such an approach has been applied to ensure that no information from a given forecast is used in the verification procedure of the same forecast (e.g., Vitart 2017).This should provide independence between forecasts and the verification subset to avoid unfair evaluation and minimize potential skill overestimation (Wilks 2006).For probabilistic forecasts, tercile categories (belownormal, near-normal, and above-normal) were analyzed as they are frequently used in forecasting and provide a useful way to assess the model's ability to distinguish between dry, normal, and wet weeks.Tercile categories were defined using precipitation totals for each model ensemble member and employing a cross-validation method leaving one year out.The lower and upper terciles were estimated after pooling all model ensemble members together.Probabilities were obtained by computing the fraction of ensemble members in each tercile category.Ensemble mean anomalies and tercile probabilities were calculated depending on the start date and lead time.Observed anomalies and binaries were calculated in the same way.
Verification metrics were calculated for each model and lead time using forecasts where the start date falls within the following seasons over the common period of 1999-2010: December-January-February (DJF), March-April-May (MAM), June-July-August (JJA), and September-October-November (SON).While these seasons may differ slightly from localized rainy seasons, they represent the main wet seasons found across Africa and are suitable for an overall evaluation.For each model, 144 forecasts (12 starts per season over 12 years) were examined using the available ensemble members shown in Table 1.Statistical significance of the correlations different from zero was analyzed using a two-sided Student's t test (Wilks 2006) with 95% significance level.The effective sample size was calculated based on lag-1 autocorrelation (Livezey and Chen 1983).

1) DRIVERS' INDICES
ENSO and IOD indices were obtained, respectively, by averaging sea surface temperature (SST) anomalies in the Niño-3.4region (58S-58N, 1208-1708W) (Bamston et al. 1997) and computing the dipole mode index (DMI) as the difference of area-averaged SST anomalies between the west (108S-108N, 508-708E) and southeastern (108S-08, 908-1108E) tropical Indian Ocean (Saji et al. 1999).The daily optimum interpolation SST version 2 (OISST.v2) of the National Oceanic and Atmospheric Administration [NOAA; Reynolds et al. (2007)] was used as observational reference and SST hindcasts from S2S database as predicting fields.Observed and forecasted weekly SST was obtained by averaging daily values over the four weeks defined in section 2a.SST anomalies were computed by removing the climatology from the total field considering a cross-validation approach.Weekly ENSO and IOD indices were normalized by the corresponding standard deviation.
The real-time multivariate MJO [RMM; Wheeler and Hendon (2004)] index was calculated as in Gottschalck et al. (2010) and Vitart (2017), which follows the same approach employed for obtaining this index made available in the S2S database.The RMM index components (RMM1 and RMM2) were computed by projecting latitudinally averaged daily anomalies of zonal wind (850 and 200 hPa) and outgoing longwave radiation (OLR) at the top of the atmosphere onto the two dominant observed eigenvectors associated with the MJO.Zonal wind at 0000 UTC from ERA-Interim reanalysis (Dee et al. 2011) and daily interpolated OLR from NOAA (Liebmann and Smith 1996) were used for calculating the observed index.Zonal wind and OLR from S2S hindcasts were selected as corresponding forecasts.Reanalysis and hindcasts were linearly interpolated from a horizontal resolution of 1.58-2.58,matching the same 144 longitudinal grid points of observed OLR and eigenvectors.Zonal wind and OLR anomalies were calculated by subtracting the climatology from the total field considering a cross-validation approach.Low-frequency signals within verifying datasets were filtered by removing the 120-day mean of the previous 120 days from each day.The 120-day mean was subtracted from forecasts using a combination of observations and hindcasts, filling with observed data the missing days preceding model's initializations.Then, anomalies were normalized by its respective observed normalization factor as in Gottschalck et al. (2010).Last, anomalies were projected onto the two leading eigenvectors and divided by the corresponding observed standard deviation calculated by Wheeler and Hendon (2004), generating RMM1 and RMM2 time series.Observed and forecasted weekly RMM components were computed following a similar approach applied for obtaining weekly SST.

SIGNAL
To explore the ability of forecasts to capture the relationship between precipitation variability and different drivers, a simple linear regression analysis between weekly precipitation and drivers' indices was performed using observations and hindcasts in weeks 1-4 for initializations within DJF, MAM, JJA, and SON.Over 18 years, 450 forecasts were used in DJF (25 starts), 468 in MAM/SON (26 starts), and 486 in JJA (27 starts).Modeled (observed) regression coefficients were obtained by regressing out hindcast (GPCP) precipitation anomalies with forecasted (observed) drivers' indices.Precipitation anomalies were computed as in section 2c.Since significant associations can exist between ENSO and IOD (e.g., Zhang et al. 2015), a multiple linear regression approach was also employed to examine ENSOand IOD-related rainfall variability simultaneously.Regression coefficients were scaled to one standard deviation of the index following Lo and Hendon (2000).A two-sided Student's t test (Allen 1997) was applied with 95% significance level for evaluating statistical significance of regression slopes different from zero.Effective sample size was determined as in section 2c.
Forecast quality was initially analyzed through the absolute difference between forecasted and observed regression coefficients to determine model's ability in representing drivers' teleconnections to African rainfall.Next, observed and modeled rainfall variations linearly dependent on drivers were, respectively, removed from observed and predicted fields to evaluate the association between observations and hindcasts after subtracting ENSO-, IOD-, and MJO-related rainfall patterns.After removing the modeled precipitation variability associated with the drivers from hindcasts, the effect of adding observed regression patterns, i.e., obtained by regressing GPCP precipitation anomalies with observed drivers' indices, to the hindcasts was also examined to verify the quality of calibrated forecasts.The regional average of the absolute difference and correlation between observations and forecasts was analyzed over the four regions shown in Fig. 1.Significant correlations were obtained as in section 2c.

Forecast quality assessment
In this section, a subseasonal African precipitation forecast quality assessment for three S2S models (ECMWF, UKMO, and NCEP) is conducted for lead times from one to four weeks ahead considering start dates in DJF, MAM, JJA, and SON during 1999-2010.For consistency between models, only results using seven ensemble members of each model are shown as findings indicated a slight improvement when examining the full ensemble size of ECMWF.Although the first seven ensemble members of ECMWF have been selected for evaluation, results are similar if chosen at random.The below-normal category assessment overall shows similar performance to the above-normal category, whereas the assessment for the near-normal category indicates unskillful forecasts.For this reason, probabilistic evaluation is focused on results for the above-normal category, with results for the other categories mentioned when necessary and made available in the online supplemental material.ECMWF and NCEP are approximately constant throughout the weeks over most regions; however, UKMO shows a drying trend with lead time, particularly in DJF and JJA.In general, ECMWF has the lowest biases over land compared to other models, with roughly similar spatial patterns to UKMO, except in DJF and SON when strong negative biases develop over EWA coastal regions in UKMO (Figs. 2a,d).NCEP generally has the opposite sign to ECMWF and UKMO over East and south-southeastern Africa, with overestimation (underestimation) for ECMWF and UKMO (NCEP) in these regions notable during their wet seasons (SON and DJF, respectively).Models have deficiencies in representing precipitation near Mozambique and Madagascar in DJF, which could affect subseasonal prediction of tropical cyclones across the region (Kolstad 2019).Large positive biases seen on the equatorial Atlantic and Indian Oceans in MAM (Fig. 2b) are likely related to shortcomings in predicting the seasonal migration of the ITCZ (e.g., Shonk et al. 2019).All models show similar biases at weeks 1-2 over the Sahel in JJA (Fig. 2c), with some evidence of a meridional tripole structure, which is particularly zonally uniform in NCEP.The drying trend in UKMO leads to strong negative biases in the core of the WAM by weeks 3-4.

a. Deterministic verification
Linear correlation is used to evaluate association between hindcasts and observed precipitation anomalies (Fig. 3).Positive correlations are strongest for week 1 and reduce with increasing lead time, with significant correlations mainly concentrated near the equator after two weeks lead, corroborating  (Figs. 3a,b,d), particularly for ECMWF.In JJA, high association is shown over West Africa near the Gulf of Guinea (GoG) for all models, with significant correlations up to week 4 (Fig. 3c).
Maps of MSSS obtained by relating the MSE between hindcasts and observed precipitation anomalies to the reference MSE are shown in Fig. 4. Skill substantially decreases over most regions from week 1 to subsequent weeks.Skill is more pronounced over East Africa in DJF, MAM, and SON (Figs. 4a,b,d), showing, for example, positive scores up to week 4 during DJF for ECMWF.Skill in JJA is restricted to a region of West Africa near the equatorial Atlantic (Fig. 4c).Despite presenting large areas of positive correlation (Fig. 3), models show negative MSSS as a remarkable characteristic in all seasons, suggesting large errors at predicting precipitation anomalies, especially UKMO and NCEP.This can be investigated by decomposing the MSSS into three squared components (Murphy 1988).
FIG. 4. As in Fig. 2, but for the mean square skill score (MSSS) between the hindcast ensemble mean and observed precipitation anomalies.
A zero anomaly forecast was adopted for representing the reference forecast.
The first term is R (2), the second is the conditional bias providing the forecast amplitude errors, and the third is the unconditional bias (1), which is zero when considering anomalies.The conditional bias is computed as [R 2 (S f /S o )] 2 , where S f and S o are the standard deviations of the forecasts and observations, respectively.By expanding the conditional bias, the MSSS can evaluated from [2R(S f /S o )] 2 (S f /S o ) 2 .When either the correlation or the ratio of the standard deviations is null, there is no skill improvement compared to the reference forecast.This also holds true for negative correlations.The ratio of the standard deviations indicates that models have stronger underestimation (S f /S o , 1) for longer leads compared to weeks 1-2 (Fig. 5b).Since the MSSS measures both the linear association and the relationship between the magnitude of the forecasted and observed anomalies, weak positive correlations in many regions (Fig. 5a) and/or underestimation of the magnitude of the anomalies leads to large negative MSSS.

b. Probabilistic verification
Probabilistic skill is assessed through the RPSS D displayed in Fig. 6.When not accounting for the small ensemble size, RPSS is negative in most regions (Fig. S1 in the online supplemental material).Negative RPSS D over climatologically dry regions indicates that tercile distribution can be skewed when the lower boundary is not well defined.RPSS D is positive in most regions and lead times for ECMWF, whereas the other models have a more mixed signal, with particularly strong negative RPSS D over some regions.While ECMWF shows positive skill over wide regions at all lead times, UKMO and NCEP generally have limited skill beyond week 1, but UKMO maintains relatively high skill over East Africa for starts in MAM and SON (Figs. 6b,d) and NCEP over West Africa in JJA (Fig. 6c).Counterintuitively, UKMO has poorest skill near DRC during weeks 1-2 compared to subsequent weeks, which is also evident to some extent in the MSSS assessment (Fig. 4), and deserves additional investigation.
The ability of probabilistic forecasts to discriminate heavier precipitation events is shown in Fig. 7 through ROC diagrams for the above-normal category using grid points over different African regions.Good discrimination is found when there is high hit rate combined with low false alarm rate.For example, when above-normal rainfall over EEA in SON is forecast with 60% of probability (square marker in Fig. 7), forecasts of above-normal rainfall for week 1 result in a 40%, 53%, and 48% hit rate against a 16%, 20%, and 23% false alarm rate for ECMWF, UKMO, and NCEP, respectively.In contrast, little differences between hit and false alarm rates for the same threshold indicate forecasts with limited value in week 4. Thus, ROC diagrams can provide support to forecast users to trigger advisory action in the decision-making process.
The reduction in discrimination from weeks 1-2 to the following weeks (Fig. 7) is consistent with the reduction of forecast quality seen in other metrics.Discrimination is slightly better for ECMWF/UKMO over NCEP and EEA over other regions, particularly in weeks 1-2, with AUC showing scores around 0.7 in week 1.A ROC score of 0.7, for instance, indicates that 70% of forecasts have higher probabilities of falling in the above-normal category when above-normal precipitation occurs compared to when it does not occur.ROC scores near 0.5 indicate the model cannot adequately distinguish between different outcomes.This provides worthless random classifications after two weeks lead for most regions.
Figure 8 shows the AD for the above-normal category using grid points over the same regions analyzed in Fig. 7. Models have better reliability and resolution in weeks 1-2 than weeks 3-4, as shown by colored lines closer to the solid diagonal line and farther from the horizontal line.Indeed, only ECMWF forecasts for week 1 fall into the zone of enhanced skill, notably in EEA and SA.In general, ECMWF has slightly better reliability and resolution than UKMO and NCEP in the two highest bins at weeks 1-2.For weeks 3-4, models show roughly similar features, though such comparable results are less apparent in the below-normal category for EEA and EWA (Fig. S4).Probabilistic forecasts can be marginally useful up to week 3 for most regions and even week 4 in the below-normal category.Such forecasts may have usefulness for decision-making as they are close to the noskill line and the slope of the colored lines is still positive (Weisheimer and Palmer 2014).
Overconfidence is a noticeable feature in all weeks confining higher (lower) probability below (above) the perfect reliability, which means that an event conditioned on a forecast probability of 70% is verified only about 50% of the time, but an event conditioned on a forecast probability of 10% is verified about 20% of the time, for example.Mean forecast probabilities are slightly higher than the mean observed frequency for extreme tercile categories in the ECMWF and UKMO models (not shown).This difference is more pronounced for NCEP and it corroborates the lowest reliability found among models (Fig. 8, Fig.S4).The reliability could be improved by employing calibration, especially after week 2 when the reliability is reduced.For the near-normal category, all models have lower mean forecast probabilities than the mean observed frequency (not shown), and have no resolution (Fig. S5).The histograms present sharper forecasts in weeks 1-2 compared to longer leads, with some cases showing a U-shaped pattern concentrating high frequencies close to the highest and lowest bins.Sharpness drops with increasing lead time with maximum frequencies appearing around climatological frequency (0.2-0.4).

Drivers modulation of forecast quality
The previous two sections show that models have best subseasonal forecasting performance for 1-2 weeks ahead, with ECMWF overall more skillful than UKMO and NCEP.Here, the link between weekly African precipitation forecast quality and important large-scale drivers, such as ENSO, IOD, and the MJO is investigated using ECMWF hindcasts during 1997-2014.The observed characteristics of weekly African precipitation variability linearly associated with those drivers are illustrated by regressing out observed rainfall anomalies with weekly mean of observed drivers' indices.Only regression coefficients for week 1 based on starts in DJF, Weekly ENSO-related rainfall variability is more pronounced over East/Southeastern Africa in DJF compared to other seasons (Fig. 9), with positive (negative) anomalies over East (Southeastern) Africa.Additionally, negative (positive) anomalies in West Africa/Sahel (East Africa) are associated with El Niño influence during JJA (SON).IOD generally starts developing in JJA and reaches its maturity in SON before dissipating around December (Cai et al. 2018).Therefore, the weak regression coefficients in DJF and MAM are likely not related to this driver.A positive relationship is verified between IOD and rainfall over Sahel in JJA and East Africa in SON.The latter shows the most striking relations between IOD and rainfall, with increasing (decreasing) East African precipitation during positive (negative) phases of the driver.Because there is significant correlation between ENSO and IOD indices (Table 2), it is likely that the regression patterns for these drivers include some signal from the other driver, and when accounting for their combined effects in the subsequent analysis, a multiple linear regression is used.Differences between simple and multiple regression patterns are most noticeable in SON, with the latter showing in particular a weaker positive precipitation signal associated with ENSO over East Africa (not shown).In DJF, RMM1/RMM2 are related to large precipitation variations over Southeastern Africa (Fig. 9).In MAM, large regression coefficients are slightly displaced to the north compared to DJF, showing significant associations between rainfall and different MJO phases in EWA (RMM1) and during the East African long rains (RMM2).The boreal summer (JJA) is characterized by the MJO influence on WAM, highlighting strong rainfall variability near the GoG and on FIG. 9. Simple linear regression between observed weekly precipitation anomalies and weekly mean of Niño-3.4,DMI, and RMM (RMM1 and RMM2 components) observed indices in week 1 for start dates in DJF, MAM, JJA, and SON over the 1997-2014 period.Regression coefficients statistically significant at the 95% level are stippled.Units are accumulated millimeters per week.Indices are normalized by their corresponding standard deviations.Gray shading as described in Fig. 2.
westernmost countries, particularly for RMM1.The MJOrelated rainfall variations on the East African short rains are verified over central-southern (easternmost) region when regressing RMM2 (RMM1) with precipitation in SON.
Figure 10 shows the regional average of the absolute difference between regression coefficients of hindcast precipitation anomalies with forecasted drivers' indices and the corresponding observed regression coefficients over African regions (Fig. 1) during weeks 1-4.Largest differences are linked to the RMM2 for most regions, except over SA in DJF.These differences are more pronounced in EEA and WAM, where precipitation variability is more closely related to RMM2 compared to other regions (Fig. 9).Larger discrepancies are also verified either for RMM1 or DMI over the same regions compared to Niño-3.4.Notwithstanding, rainfall anomalies over EEA in MAM are only weakly associated with RMM1 (Fig. 9) and IOD is usually inactive.Overall, ENSO signal is not well captured over SA compared to other relevant drivers in DJF, especially after week 1.Moreover, absolute differences increase with lead time in EEA during SON, which may affect subseasonal short-rains predictions.When considering errors relative to the observed regression patterns, i.e., dividing the absolute differences by the corresponding observed rainfall response to the driver, the errors are more balanced, except for DMI over EEA in DJF and RMM2 over EWA in MAM (not shown).Forecasted regression patterns suggest that largest errors in Fig. 10 are related to model's shortcomings in representing both the location and amplitude of particular driver-related rainfall anomalies (Figs.S6 and S7).
The modulation of subseasonal African precipitation forecast quality by the strength of the drivers' teleconnections within the ECMWF model is investigated in Fig. 11.This association is assessed by the regional average of the correlation between observations and forecasts after removing the observed and modeled regression patterns, computed between the corresponding precipitation anomalies and drivers' indices, from observed and predicted fields, respectively.10.Regional average of the absolute difference between the linear regression of the ECMWF hindcast ensemble mean precipitation anomalies with forecasted drivers' indices and the corresponding observed regression coefficients over African regions (Fig. 1) during weeks 1-4 for initializations in DJF, MAM, JJA, and SON over the 1997-2014 period.Observed (forecasted) regression coefficients were calculated using observed (forecasted) indices and observed (forecasted) precipitation anomalies.A multiple linear regression approach was employed to assess ENSO and IOD signal on rainfall simultaneously (see text for details).Indices were normalized by their corresponding standard deviations.Units are accumulated millimeters per week.
Unauthenticated | Downloaded 04/26/21 03:19 PM UTC Lowest correlations are verified when all driver-related regression patterns are sequentially subtracted compared to when no removal is considered.This difference is more pronounced for longer leads and particular regions, such as EEA in DJF/SON.The impact of removing both ENSO and IOD signals is more noticeable in DJF and SON over EEA than in other seasons and regions, with larger ENSO (IOD)related effects in the former (latter) season.Intriguingly, the impact of removing the IOD (or IOD1ENSO) signal in SON affects forecast association more than the impact of removing all drivers.This could be related to the fact that all indices are significantly correlated during the period under consideration (Table 2), which means may be removing the same signal in different ways.It is well known that ENSO and IOD are interannual modes of variability and their patterns project onto the RMM index, in particular for ENSO (Wheeler and Hendon 2004).When computing the RMM index, the methodology subtracting the previous 120 days is supposed to account for removing low-frequency variations (see section 2d).However, such approach will not necessarily be effective in the drivers' developing and decaying phases as the last 120 days may not include their signatures.Thus, the correlations between ENSO/IOD indices and the MJO index are not physically easy to interpretate and it may not be fair to perform a multiple linear regression including all indices without underlying physical understanding.Additionally, the correlations between RMM components are nonzero likely because the two eigenvectors were calculated over all seasons and the correlations in individual seasons might not be null.
Forecast quality in MAM/JJA is more affected by subtracting the MJO-related rainfall variability than other drivers (Fig. 11).This would be expected since ENSO and IOD are usually weak or inactive during those seasons.When assessing the impacts of removing the MJO-related rainfall variability individually, lowest correlations are found for all weeks after subtracting RMM2 signal over EEA in MAM and WAM in JJA (not shown).In contrast, RMM1-related rainfall variability has a more significant association with forecast quality in the first two weeks over EWA in MAM (not shown).Although a different period has been analyzed in section 3a (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010), correlations for ECMWF in Fig. 3 could be linked to the sources of subseasonal predictability examined here, with large associations in regions where those drivers have strong linear relationships with precipitation (Fig. 9).
To further explore drivers' signals on forecast quality, Fig. 12 displays the regional average of the correlation between observations and forecasts after adding the corresponding observed regression patterns to hindcasts, that is replacing the modeled linear response to the driver with the observed response to the driver.The general picture is that a clear improvement in forecast quality is shown if all observed driver-related regression patterns are added to hindcasts and compared to the ''NO ADDITION '' case (i.e.,using FIG. 11.Regional average of the correlation between the ECMWF hindcast ensemble mean and observed precipitation anomalies over African regions (Fig. 1) during weeks 1-4 for initializations in DJF, MAM, JJA, and SON over the 1997-2014 period.Correlations were obtained after removing particular observed and forecasted regression patterns (colored bars), calculated between the corresponding precipitation anomalies and drivers' indices, from observations and hindcasts, respectively.A multiple linear regression approach was employed to assess ENSO and IOD signal on rainfall simultaneously (see text for details).''ALL'' denotes that the correlation was computed after subtracting sequentially all drivers' regression patterns.''NO REMOVAL'' indicates that the correlation was obtained without removing any regression pattern.Hatches over the bars denote correlation coefficients are not statistically significant at the 95% level.uncalibrated forecasts), especially in weeks 3-4.These enhanced associations mostly respond to the MJO-related rainfall variability, in particular owing to RMM2 signals (not shown), but there is also an improvement in association in response to ENSO and IOD over EEA in SON.This may indicate that better subseasonal predictions of specific MJO phases and its teleconnections could help improve the quality of weekly rainfall forecasts over most regions analyzed here.

Summary and conclusions
This study has conducted an evaluation of the quality of subseasonal precipitation forecasts over Africa and examined its relationships with particular climate drivers.A comprehensive assessment of forecasts depends on how well models represent the attributes of forecast quality defined in Murphy (1993).We initially investigated weekly accumulated African precipitation forecast quality using hindcasts provided by three S2S models (ECMWF, UKMO, and NCEP) and precipitation from the GPCP dataset.Start dates within DJF, MAM, JJA, and SON were selected to assess forecasts from one to four weeks ahead during 1999-2010.Deterministic and probabilistic forecasts were evaluated employing a variety of metrics to provide a more detailed assessment.Then, weekly precipitation forecast quality was linked to key drivers (ENSO, IOD, and the MJO) by exploring the ECMWF model's ability in representing drivers' signals on African precipitation and their contribution to the quality of forecasts during 1997-2014.
The deterministic evaluation indicated significant correlations greater than 0.4 between hindcasts and observations for all models in weeks 1-2 over East Africa in DJF/MAM/SON and near GoG in JJA.This corroborates best MSSS findings, in which skill in weeks 1 and 2 was improved up to 70% and 50% relative to the reference forecast, respectively.Further investigation of this correspondence was provided by decomposing the MSSS, revealing unskillful predictions linked to low forecast association and/or large underestimation of predicted variance.Analysis of bias indicated a large overestimation (underestimation) in wet regions during particular rainy seasons for ECMWF and UKMO (NCEP), though over WAM in JJA models showed a similar bias pattern with a meridional tripole structure.
The evaluation of probabilistic forecasts showed large deficiencies in the near-normal category.Low forecast quality in this category has been related to the fact that such forecasts deviate very little from tercile-based climatological probability (Kharin and Zwiers 2003b).The consequences of issuing poor forecast quality in the near-normal category can be very harmful for forecasters and users.Leading, for example, to increased uncertainty in any tercile-based forecast information and reduced effectiveness of such information in decisionmaking.Thus, some operational forecasting centers assign the climatological probability to the near-normal category and issue outlooks for the most likely outer tercile category (Peng et al. 2012).Erroneous forecasts in the near-normal category indicate the need to review the scientific knowledge and develop improved methods of estimating probabilities (Kharin and Zwiers 2003b).
One the other hand, more skillful forecasts with roughly similar characteristics were identified in the outer tercile FIG.12.As in Fig. 11, but for correlations obtained after adding particular observed regression patterns (colored bars) to hindcasts analyzed in Fig. 11.''ALL'' denotes that the correlation was computed after adding sequentially all drivers' regression patterns.Gray bars are equivalent to those in Fig. 11, with no removal or addition of any regression pattern to observations and hindcasts.categories.These forecasts showed better discrimination over EEA compared to other regions, particularly in weeks 1-2.AUC could quantitatively summarize models' performance to discriminate extreme events.For example, ECMWF correctly predicted around 70% (65%) of above-normal forecasts when above-normal rainfall occurred in EEA during the first (second) week of forecasts.Nevertheless, this agreement reduced to less than 60% of forecasts in subsequent weeks, indicating forecasts with limited value to forecasters and decision-makers.Despite having found better reliability, resolution, and sharpness in weeks 1-2, with slightly enhanced skill for ECMWF over EEA/SA in week 1, overconfidence was verified in all weeks, showing probabilities closer to the climatological distribution for longer lead times.Since models' probabilistic skill can be associated with other attributes, such as reliability and resolution, it is suggested that overconfidence has increased forecasting errors, inducing more unskilled forecasts, especially beyond two weeks lead, as verified in the RPSS D assessments.
One aspect of the forecast verification we have not addressed is the relation between metrics analyzed and its practical implications for forecasting routines.In terms of deterministic forecasts, forecasters would judge how skillful forecasts are by relating the MSSS to correlation and the ratio of the forecasted to observed variances.Forecast quality would be determined by assessing the overall balance between those metrics, with high correlation and small variance errors indicating more skillful forecasts.For probabilistic forecasts, skillful outcomes could be identified by relating RPSS D to the overall balance between reliability and resolution.Forecasters would identify more skillful forecasts when model's accuracy is large owing to more reliable forecasts and improved resolution.AUC would provide similar qualitative information as the resolution assessment (Toth et al. 2003).
When assessing the ability of the ECMWF model in representing particular climate drivers' signals on regional African rainfall, it was found larger errors in capturing rainfall variations linearly related to the MJO-RMM2 index over most regions compared to other indices (RMM1, DMI, Niño-3.4).This suggests that the model does not reproduce the local impacts of the MJO properly and in particular those phases associated with RMM2 (i.e., 2 and 3; 6 and 7).Shortcomings in simulating driver-related rainfall variability could affect subseasonal predictions of important weather systems influencing African rainfall, as, for instance, ITCZ and tropical cyclones.
To analyze weekly forecast quality linked to the strength of drivers' teleconnections, regional correlations between observations and hindcasts were calculated after removing the corresponding driver-related rainfall regression patterns from observations and forecasts.When removing all drivers' signals sequentially, results showed significant reduction in association compared to when no subtraction was considered.The removal of regression patterns individually indicated that the MJO contribution to forecast quality was more dominant during seasons when ENSO and IOD are usually inactive, such as MAM/JJA.Although ENSO is expected to be correlated with EEA rainfall during the short rains season (e.g., Hoell et al. 2014), enhanced associations were particularly linked to IOD.A multiple linear regression analysis revealed that a large portion of the ENSO signal on EEA rainfall during SON can be attributed to the IOD.
It is worth noting that even verifying forecast quality closely related to ENSO and IOD during DJF and SON, respectively, the effect of calibrating forecasts by adding observed regression patterns to hindcast revealed improved forecast associations especially linked to the MJO.Despite identifying significant associations, the drivers analyzed could not account completely for the overall forecast quality.The fact the significant correlations remained after removing ENSO and IOD effects indicates that the quality of forecasts does not depend solely on these interannual modes of variability.Furthermore, while as significant contributor to the forecast quality the MJO is not the only important source of subseasonal predictability.This suggests there is a need for assessing other drivers, including but not limited to, SST variability over the GoG and soil moisture initializations.However, our results still support that forecast quality of weekly rainfall over Africa is regime-dependent, i.e., related to the major tropical sources of S2S predictability.Moreover, it is clear that improving the representation of these drivers-and their regional impacts-within the ECMWF model has the potential to deliver better subseasonal predictions for Africa.
This paper investigated single models when developing a weekly precipitation forecast verification framework for Africa.Combining forecasts from a multimodel perspective may help to improve resolution and discrimination of predictions (e.g., Vigaud et al. 2018).Different calibration methods, such as model output statistic (e.g., Doss-Gollin et al. 2018), should be explored to identify which is the best one practice to be employed for delivering more reliable forecasts to operational centers and applications communities.Although these aforementioned techniques have not been adopted here, this comprehensive verification guide for forecasting weekly rainfall across Africa provides a valuable tool for forecasters and decision-makers to better understand the African regions and seasons with useful subseasonal skill.Furthermore, the results linking known S2S drivers with forecast quality in this study have huge potential to assist forecasters to better interpret regimedependent skill, which, if successfully communicated, can increase confidence in its appropriate use in decision-making across a range of sectors and societal applications.

Figure 2
Figure2shows the mean error between hindcast and observed precipitation totals.Biases differ among seasons and in

FIG. 3 .
FIG.3.As in Fig.2, but for the Pearson's correlation coefficient (R) between the hindcast ensemble mean and observed precipitation anomalies.Stipples indicate correlations statistically significant at the 95% level.
FIG. 5. Regional average of the (a) correlation (R) between the hindcast ensemble mean and observed precipitation anomalies and (b) ratio of the standard deviations of the hindcast ensemble mean and observations (S F /S O ) over African regions (Fig.1) for ECMWF (green line), UKMO (red line), and NCEP (black line) models in weeks 1-4 for initializations during DJF, MAM, JJA, and SON over the 1999-2010 period.Circle markers in (a) denote correlation coefficients statistically significant at the 95% level.

FIG. 6 .
FIG.6.Discrete ranked probability skill score (RPSS D ) between hindcast probabilities and binary observation obtained from precipitation totals in the tercile categories for ECMWF, UKMO, and NCEP models in weeks 1-4 for initializations during (a) DJF, (b) MAM, (c) JJA, and (d) SON over the 1999-2010 period.Gray shading as described in Fig.2.A climatological probability of 1/3 was used as the reference forecast.
. Attributes and metrics are summarized below, with a more detailed description in Coelho et al. (2019):

TABLE 1 .
The main features of the three S2S operational models and their hindcasts.

TABLE 2 .
Correlations between the observedENSO, IOD, DMI, RMM1, and RMM2)in week 1 for start dates in DJF, MAM, JJA, and SON over the 1997-2014 period.Correlations are roughly similar to the ones found in weeks 2-4.Correlation coefficients statistically significant determined from a two-sided Student's t test at the 95% level are shown in bold.Effective sample size was estimated as in section 2c.