Subseasonal probabilistic prediction of tropical cyclone (TC) genesis is investigated here using models from the Seasonal to Subseasonal (S2S) Prediction dataset. Forecasts are produced for basin-wide TC occurrence at weekly temporal resolution. Forecast skill is measured using the Brier skill score relative to a seasonal climatology that varies monthly through the TC season. Skill depends on models’ characteristics, lead time, and ensemble prediction design. Most models show skill for week 1 (days 1–7), the period when initialization is important. Among the six S2S models examined here, the European Centre for Medium-Range Weather Forecasts (ECMWF) model has the best performance, with skill in the Atlantic, western North Pacific, eastern North Pacific, and South Pacific at week 2. Similarly, the Australian Bureau of Meteorology (BoM) model is skillful in the western North Pacific, South Pacific, and across northern Australia at week 2. The Madden–Julian oscillation (MJO) modulates observed TC genesis, and there is a relationship, across models and lead times, between models’ skill scores and their ability to accurately represent the MJO and the MJO–TC relation. Additionally, a model’s TC climatology also influences its performance in subseasonal prediction. The dependence of the skill score on the simulated climatology, MJO, and MJO–TC relationship, however, varies from one basin to another. Skill scores increase with the ensemble size, as found in previous weather and seasonal prediction studies.
The Madden–Julian oscillation (MJO; Madden and Julian 1972) modulates tropical cyclone (TC) activity globally. The probability of TC genesis is typically greater during or after a strong convective MJO phase than at other times (Camargo et al. 2009; Klotzbach 2014; Klotzbach and Oliver 2015a). In the Atlantic (Mo 2000; Maloney and Hartmann 2000b; Klotzbach 2010; Klotzbach and Oliver 2015b), the enhanced storm genesis occurs when a strong MJO is active in the Indian Ocean. In contrast, in the eastern North Pacific, enhanced TC genesis occurs when a strong MJO is active in the central and eastern North Pacific (Molinari et al. 1997; Maloney and Hartmann 2000a, 2001; Aiyyer and Molinari 2008). Similarly, in the western North Pacific (Nakazawa 1988; Liebmann et al. 1994; Sobel and Maloney 2000; Kim et al. 2008; Li and Zhou 2013), North (Nakazawa 1988; Liebmann et al. 1994; Kikuchi and Wang 2010; Krishnamohan et al. 2012) and South (Bessafi and Wheeler 2006; Ho et al. 2006) Indian Ocean, and South Pacific (Hall et al. 2001), the number of storms increases when the MJO is active in these basins. Additionally, typhoon tracks shift eastward when the convective MJO is active in the Indian Ocean and shift westward in the western Pacific (Kim et al. 2008). Rapidly intensifying storms are more frequent in the Atlantic when the MJO is active in the Indian Ocean (Klotzbach 2012). A strong active MJO increases local values of an empirical TC genesis index (Camargo et al. 2009) through systematic enhanced low-level absolute vorticity and increased midlevel relative humidity.
In these observational studies, it is often stated that accurate predictions of the MJO and knowledge of the MJO–TC relationship offer the potential for forecasts of the probability of TC genesis with a few weeks lead time. Regional statistical models for subseasonal TC prediction have in fact been developed (Leroy and Wheeler 2008; Slade and Maloney 2013) using MJO indices, as well as other environmental parameters. When an MJO index is added as one of the predictors, there is a significant, albeit small, improvement of skill at leads up to 3 weeks. For longer leads, the forecast skill is thought to be primarily from the climatological seasonal cycle and interannual variability. Reforecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF) also suggest that the accuracy of the MJO prediction has a significant impact on the predicted TC frequency (Vitart 2009). Compared to Southern Hemisphere TC statistical forecasts (Leroy and Wheeler 2008), the ECMWF model has greater skill in predicting TC occurrence at week 1, while the statistical model performs better for longer leads (Vitart et al. 2010). Furthermore, the ECMWF skill in predicting Atlantic hurricane activity is sensitive to the MJO phase and amplitude at the time of the model initialization (Belanger et al. 2010).
With the increasing demand for forecasts on the time scale between weather and seasonal–interannual—the so-called subseasonal time scale—an international effort was initiated to improve and develop various aspects of dynamical subseasonal predictions, including subseasonal TC forecasts. A key goal of these efforts is improved understanding of the factors that affect forecast prediction skill. The multimodel Seasonal to Subseasonal Prediction (S2S; Vitart et al. 2017) dataset, containing extensive reforecasts with lead times up to 60 days, is ideal for this task. In this study, we focus on the subseasonal prediction of TC genesis in the S2S reforecasts. While the ability of global models to simulate the MJO–TC modulation (Vitart 2009; Satoh et al. 2012; Kim et al. 2014; Murakami et al. 2015; Xiang et al. 2015) and the prediction skill of TC genesis prediction on subseasonal time scales have been analyzed (Belanger et al. 2010; Elsberry et al. 2011; Tsai et al. 2013; Elsberry et al. 2014; Nakano et al. 2015; Barnston et al. 2015; Yamaguchi et al. 2015; Li et al. 2016; Camp et al. 2018) for various models, this is the first comprehensive multimodel, multiyear analysis of reforecasts of TC genesis prediction on subseasonal time scales.
Here, we use the S2S data to construct probabilistic and deterministic reforecasts of the of basin-wide TC occurrence with weekly temporal resolution. The prediction skill is evaluated using the mean square error skill score and the Heidke skill score for deterministic forecasts, as well as the Brier skill score for probabilistic forecasts. Reforecasts, observations, skill scores, and other analysis methods are described in section 2. We then discuss TC climatology in the reforecasts in section 3 to define the tropical storm thresholds and seasonality for prediction skill evaluation. The simulated and observed MJO modulation of TC genesis is examined in section 4. Then, we analyze the prediction skill as well as the potential predictability in section 5. Connections between the skill scores and the model characteristics, the initialization, and the ensemble prediction system design are examined in section 6. Results are then summarized in section 7.
2. Data and methods
a. S2S reforecasts
Table 1 shows some basic characteristics of the S2S reforecasts used here. They are obtained from coupled, global general circulation models run by six operational centers: the Australian Bureau of Meteorology (BoM), the China Meteorological Administration (CMA), the ECMWF, the Japan Meteorological Agency (JMA), the Météo-France/Centre National de Recherche Météorologiques (MetFr), and the National Centers for Environmental Prediction (NCEP). The first ensemble member is the unperturbed control run. Note that because the designs of the ensemble prediction systems (specifically, the frequency of forecasts and ensemble size) differ among these agencies, the reforecasts are heterogeneous. We treat such differences in system design as additional factors contributing to prediction skill. Another heterogeneous feature is that the reforecast periods differ. While this might affect the comparison, we do not think it is likely to qualitatively change the relative skill of the forecast systems. Further details of the S2S dataset are described by Vitart et al. (2017). All the S2S reforecasts are archived on a 1.5° × 1.5° grid at daily resolution.
b. TCs in the S2S models and observations
To track TCs in the S2S models, we employ the tracker from Vitart and Stockdale (2001). The tracker defines a storm center at a local minimum sea level pressure where 1) a local vorticity maximum (>3.5 × 10−5 s−1) at 850 hPa is nearby, 2) a local maximum in the vertically averaged temperature (warm core, >0.5°C) in between 250 and 500 hPa is within a distance (in any direction) equivalent to 2° latitude, 3) the two locations detected from criteria 1 and 2 above are within a distance equivalent to 8° latitude, and 4) a local maximum thickness between 1000 and 200 hPa can be identified within a distance equivalent to 2° latitude. Additionally, a detected storm must last at least 2 days to be included in our analysis.
In general, the criteria used in a tracker should vary with model resolution (Walsh et al. 2007; Camargo 2013). TC detection, however, is very sensitive to the input thresholds (Horn et al. 2014; Zarzycki and Ullrich 2017); changing criteria without a thorough investigation could potentially introduce artifacts into the results. Furthermore, all the S2S data are archived on a common grid. Therefore, in this study, we use the same criteria (as described above) for all models. A potential impact due to the interpolation of the atmospheric fields from a high-resolution model output to a low-resolution common grid is that it might reduce the strength of the warm core, the vorticity, and the pressure. The vorticity might be most strongly affected. Because the criteria used here were set for a low-resolution model, the impact on the weakening of these fields is not expected to affect the number of the detected TCs. The S2S TC tracks contain daily values of maximum sustained winds and storm locations.
Observations of tropical cyclone tracks are derived from the HURDAT2 dataset, produced by the National Hurricane Center (NHC; Landsea and Franklin 2013) and from the Joint Typhoon Warning Center (JTWC; Chu et al. 2002). Both best-track datasets include 1-min maximum sustained wind, minimum sea level pressure (not used in this study), and storm location every 6 h.
Following the conventional definitions, the TC basins are the Atlantic (ATL), northern Indian Ocean (NI), western North Pacific (WNP), eastern North Pacific (ENP), southern Indian Ocean (SIN, 0°–90°E), northern Australia (AUS, 90°–160°E), and southern Pacific (SPC, east of 160°E).
c. MJO definition
d. Skill scores
A skill score is an index that measures the model prediction skill relative to a reference value. Three skill scores are used here: mean-square error (MSE) skill score (MSESS), Heidke skill score (HSS), and Brier skill score (BSS). MSESS and HSS are for evaluating deterministic forecasts while BSS is for probabilistic predictions.
MSESS is applied to the predicted storm numbers and is defined as
where N is the total number of forecasts, is the predicted genesis number for the ith forecast, and is the ith observation. MSEref is the MSE of a reference based on observed climatology. MSESS larger than 0 means the model has higher skill than the climatological reference.
HSS compares the proportion of correct categorical forecasts to that which would be expected by random forecasts that are statistically independent of the observations. We use two categories here: 0 for no genesis and 1 for one or more storms forming during the forecast period. The ratio of the number of correct forecasts to that of all forecasts, commonly called percent correct (PC), is defined as
where a represents the frequency of observed geneses that are correctly forecast, b is the frequency of false alarms, c represents the observed geneses that are not forecast, and d are cases that were neither forecast nor occurred.
The marginal probability of a 1 forecast is and that for a 1 observation is . Thus, the probability of a correct 1 forecast by chance is . Similarly, the probability of having a correct 0 forecast by chance is . Thus, the probability E of a correct forecast due to chance is
HSS is therefore defined as
The HHS is 1 if all forecasts were correct (i.e., when PC equals 1) and is 0 if the model has no skill (i.e., PC equals E).
BSS is used to assess the skill of a probabilistic forecast of basin-wide TC occurrence relative to a climatological forecast. The Brier skill (BS) is defined as
where N is the total number of forecasts, is the ith observation, and is the predicted probability of TC occurrence for the ith forecast, defined as
In Eq. (8), M is the number of ensembles, is the genesis prediction from the jth ensemble member for the ith forecast. Both and are 0 for no genesis and are 1 for one or more occurrences of storm genesis during the forecast period. Thus, BS is the mean-square probability forecast error. The BSref is similar to BS but for a reference forecast based on the observed climatology. In this study, two climatologies are used. One is the seasonally varying climatology at a monthly time resolution, while the other is a constant, annual mean climatology. When a model is skillful compared to the climatology, the BSS is positive.
e. Candy plot analysis
To analyze the dependence of TC genesis on MJO phases, the probability density function (PDF) of storm genesis is calculated in each TC basin and binned by MJO phase. To identify favorable and unfavorable MJO phases, values of the TC number in each week are randomly shuffled in time throughout the entire period to obtain PDFs independent of the MJO. The favorable MJO phases are then defined when the unshuffled PDF is larger than the 90th percentile of the 4000 PDFs obtained from randomly swapping the data, and the unfavorable MJO phases are defined when the unshuffled PDF is less than the 10th percentile of the randomized PDFs. The PDFs are then organized by the longitude along the Y axis from ATL to ENP and by the MJO phases along the X axis, like a sheet of candies.
The candy plot analysis is conducted for observations and the six S2S models. The comparison of the global pattern (consisting of PDFs from all TC basins) between each S2S model and the observed pattern is then quantified using (i.e., the fraction of the variance of the observed PDFs that is predicted by the model PDFs). In this analysis, we include only storms when the magnitude of the MJO index is larger than one standard deviation. The fractions of storms we used are 60% in the observations, and roughly 60%, 50%, 40%, 52%, 65%, and 52% in BoM, CMA, ECMWF, JMA, MetFr, and NCEP reforecasts, respectively.
3. Tropical cyclone climatology
a. Intensity and tropical storm threshold
Since their horizontal resolutions are inadequate to represent the TC inner-core structure, the global models used here are not able to simulate the highest observed TC intensities. Another factor that impacts the simulated intensities as represented in the S2S archive is that the model outputs are instantaneously archived on a 1.5° × 1.5° grid in the S2S database every 24 h. As a result, the cumulative density distribution (CDF) of TCs’ lifetime maximum intensity (LMI) shows that the median LMI for the observed storms is 50 kt (1 kt = 0.51 m s−1) while it is in the range of 25–35 kt for the S2S models, except for BoM, which has a median LMI of 40 kt (Fig. 1). The BoM model is able to simulate stronger storms than other S2S models that have higher horizontal resolutions. This could be due to its physical parameterizations, dynamical cores, or both. Multiple studies with other global climate models have noted that both factors are important, so that the maximum TC intensities simulated by a model are not a simple function of that model’s horizontal resolution (Vitart et al. 2001; Murakami et al. 2012; Zhao et al. 2012; Reed et al. 2015; Duvel et al. 2017; Kim et al. 2018). Although there is a significant low bias in the S2S TC intensities, we can categorize storms using quantile analysis (Camargo and Barnston 2009). For example, the observed tropical storm (TS) wind speed threshold is 34 kt, which in the observed LMI distribution corresponds to the 18th percentile (gray line in Fig. 1). Thus, we define the tropical storm threshold as the 18th percentile of the LMI CDF in each model. These are 34, 23, 24, 24, 27, and 26 kt, respectively, for the BoM, CMA, ECMWF, JMA, MetFr, and NCEP models. In this study, we only consider TCs that reach the tropical storm threshold thus defined in observations and the model reforecasts.
b. Genesis and TC season definition
Genesis time is defined here as the time of the first point recorded on each forecast track. TCs that exist prior to the model initialization time have already undergone genesis. Nevertheless, for purposes of model evaluation, we refer to the first recording time (usually day 1, or t = 24 h) of the preexisting storms as their genesis time. The reason for including preexisting storms in our analysis will be discussed in section 5. With the exception of week 1—when there is a higher TC occurrence because of the preexisting storms—the forecast genesis climatology does not change much with lead time. Therefore, while we only show here the genesis climatology for week 2 forecasts (Fig. 2), our results are also valid at longer lead times.
Globally, the ECMWF model that is statistically significant generates 20% more TCs than observed, while CMA, MetFr, and NCEP have genesis rates 140%, 65%, and 80% higher than observed, respectively. In contrast, the BoM and JMA models generate 35% and 45% fewer TCs than are present in the observed climatology. Low-resolution models often have unrealistically high TC genesis rates in the subtropics, when storms are detected and tracked using algorithms with model-dependent thresholds (Camargo 2013). However, the difference maps between simulated and observed genesis counts (Figs. 2b–g) suggest that this is not the case for the S2S models, since the errors in the subtropics are much smaller than those in the tropical belt (30°S–30°N).
Regionally, the strongest observed local maxima of the TC genesis rate occur in the ENP and WNP (Fig. 2a). In the three Southern Hemisphere basins (SIN, AUS, and SPC), the observed storms form in an elongated area around 15°S. In general, the S2S models are able to capture these local maxima (not shown), and the ECMWF model has the smallest regional biases, followed by the BoM model. The JMA model underestimates the rate of TC genesis everywhere, while CMA, MetFr, and NCEP overestimate it. In the individual basins, the models that have the smallest mean bias (in number of storms per year) are the MetFr model (−0.01) for the ATL, the BoM (0.22) for the NI, the ECMWF (0.11) for the WNP, the MetFr (−0.24) for the ENP, the JMA (−0.3) for the SIN, the BoM (−0.6) for the AUS, and both the ECMWF (0.33) and BoM (0.36) for the SPC. The CMA model has the largest positive bias in the Pacific Ocean (Fig. 2c), with more than 1 storm per year per grid (4° × 4°) between 4° and 12°N, compared to the observed climatological mean of less than 0.2 storms per year per grid (Fig. 2a).
Despite these biases in the total TC counts and genesis spatial distribution, the S2S models represent the annual cycle of TC genesis reasonably well (Fig. 3). We define regionally varying TC seasons that consist of the months with genesis rates higher than 5% of the annual genesis rate in each region. Using this definition, the TC seasons in some models are slightly different than in the observations. For example, the observed hurricane season in the ENP is defined as May–October, but it is from July to December in the BoM model. The simulated TC seasons in the ECMWF model best match those from the observations.
Although there are differences between simulated and observed TC seasons, our goal is to have skillful TC predictions during the observed TC seasons. Therefore, the observed TC seasons are used in our prediction skill evaluation below. Using this definition, the TC seasons are June–November in the ATL, May–October in the ENP, May–December in the WNP, April–June and September–December in the NI, October–April in the SIN, and November–April in the AUS and SPC.
4. MJO–TC modulation
Next, we examine whether the S2S models are capable of simulating the observed MJO–TC modulation. Specifically, we refer to the spatial distribution of genesis as a function of the MJO phase, such as the anomalous fields in Figs. 4 and 5, and the basin-wide PDFs in Fig. 6. Our focus is on the week 2 reforecasts, when all S2S models show skill in predicting the MJO (Vitart 2017).
Global climate models are able to simulate the observed dependence of TC genesis on the MJO (Nakazawa 1988; Liebmann et al. 1994; Mo 2000; Maloney and Hartmann 2000b,a; Kim et al. 2008; Li and Zhou 2013; Krishnamohan et al. 2012; Bessafi and Wheeler 2006). High horizontal resolution is often cited as a necessary condition to capture the MJO–TC modulation (e.g., Zhang 2013; Camargo and Wing 2016). However, the necessary resolution is not precisely defined. For example, the high horizontal resolution Zhang (2013) and Camargo and Wing (2016) referred to varies from 50 km for the EMCWF (Vitart 2009) and the Geophysical Fluid Dynamics Laboratory (GFDL) High Resolution Atmospheric Model (HiRAM; Jiang et al. 2012) to 14 km for the Japanese Nonhydrostatic Icosahedral Atmospheric Model (NICAM; Oouchi et al. 2009). Furthermore, good representations of convection and microphysics are key elements in order to simulate well the MJO in global models (Kim et al. 2012, 2014; Holloway et al. 2013; Kang et al. 2016). Changing the horizontal resolution (with the same physical packages for the same model) does not necessarily improve the MJO simulation (Jia et al. 2008; Holloway et al. 2013; Hung et al. 2013). Horizontal resolution might be important primarily through its influence on the models’ ability to simulate TCs and their interaction with the ambient environment (Kim et al. 2018).
The S2S models have horizontal grid spacings from 0.25° to 2°. All of them are able to capture, at least qualitatively, the observed eastward propagation of TC genesis anomalies in the Southern Hemisphere with increasing MJO phase (Fig. 4). While the observed eastward-propagating signal is weaker in the Northern Hemisphere (Fig. 5), we can still see positive TC genesis anomalies propagating from the NI to the ATL (i.e., from MJO phases 2–3 to 8–1). The Northern Hemisphere eastward propagation is stronger in most of the S2S models than in the observations. The observed positive anomalies in the ENP for MJO phases 6 and 7 are too strong and expand toward the WNP for the ECMWF and MetFr models. Similarly, the WNP anomalies for MJO phases 8–1 are overpredicted and expand toward the ENP for the BoM, CMA, ECMWF, and NCEP models. The JMA model is not able to capture the MJO eastward propagation in the Northern Hemisphere.
To further identify the MJO phases that are favorable for TC genesis in individual basins, we perform a candy plot analysis (section 2), which shows the storm genesis rate binned by MJO phase in each TC basin. In the observations (Fig. 6a), 31% of observed ATL hurricanes form when the MJO convection center is in the Indian Ocean (MJO phases 2–3). Similarly, in the SIN, a higher rate of genesis (50% of all storms) occurs during the MJO phases 2–3. The favorable MJO phases for TC occurrence are 3–5 for the NI, 3–4 for the AUS, 5–6, for the WNP, 7–8 for SPC, and 7 and 8–1 for the ENP. The favorable MJO phases are listed by basin with increasing longitudes so that they line up from the bottom-left corner to top-right corner in Fig. 6a.
The favorable MJO phases in the S2S models also show a bottom-left to top-right trend, with the exception of the JMA model. The ECMWF model (with horizontal resolutions of 0.25°–0.5°) best simulates the observed MJO–TC pattern (Fig. 6b), and explains 64% () of the observed variance. The MetFr, NCEP, CMA, and BoM models (in order of model resolution from 0.7° to 1°) explain 42%, 50%, 41%, and 47% of the variance, respectively. The JMA model does not capture the upward trend of the MJO–TC pattern, because MJO phases 5 and 6 occur much more frequently in the model than in the observations, especially during the Northern Hemisphere TC seasons (Fig. 6e). As a result, despite having 0.5° horizontal grid spacing, the JMA model explains only 23% of the observed MJO–TC relationship. Our results suggest that while model resolution plays an important role in simulating TCs, it is probably not the most important factor for simulating the observed modulation of TC genesis by the MJO.
5. Genesis forecast skill
On the scale of weather forecasting (2–5 days), deterministic prediction of TC occurrence is often used. Beyond 5 days, probabilistic forecasts from ensemble systems are used because they provide information on the uncertainty of the forecasts (Elsberry et al. 2014). Furthermore, the prediction skill from the mean of an ensemble might be higher than that of a single “deterministic” run even if the latter has higher resolution. In this section, we will show the models’ skill in predicting TC genesis in both deterministic and probabilistic forecasts. Our focus is the probabilistic prediction skill, and we will also discuss its potential predictability.
a. Deterministic prediction: MSESS and HSS
The deterministic TC occurrence prediction is defined using the ensemble mean forecasts here. The deterministic prediction skill is quantified using MSESS and HSS (section 2). HSS measures the S2S models’ ability to forecast storm occurrence (regardless of number) while MSESS evaluates not only occurrence but also the number of the storms.
Comparing to the random predictions, the S2S models are more skillful at predicting storm occurrence within a basin at most leads (Fig. 7). For week 1 occurrence prediction, the ECMWF model has the highest values of the HSS in all basins except in the ATL basin where the NCEP model has higher HSS. After week 1, the values of HSS in most of the models drop significantly, and it is hard to distinguish among them. Note that HSS compares the ratio of correct forecasts from the S2S models to that expected by chance, without taking into account the climatology. The values of HSS alone do not tell if the S2S models are more skillful than climatology.
Therefore, we create a no-skill reference forecast that shows the skill score from knowing only the seasonal climatology (dashed line in Fig. 7). The no-skill reference is calculated by verifying the S2S predictions against observations that are shuffled by year. For example, the S2S predictions from 2005 are evaluated using observations from a randomly selected year. Doing so, we keep the seasonality in the shuffled observations but remove the year-to-year dependence. With a few exceptions, the HSS values of the no-skill references are positive. This is because the S2S models simulate the observed genesis seasonality reasonably well, as we discussed earlier in the section 3. In the cases when the value of the model HSS is close to the no-skill reference, the climatology contributes to most of the prediction skill. In Fig. 7, the solid lines merge to the dashed lines for most of the models after week 2, except for the ECMWF model. For the BoM and CMA models in the Atlantic basin, even the week 1 prediction skill is largely due to knowing the seasonal climatology. In the ECMWF model, the HSS is larger in all basins than the corresponding no-skill reference, indicating that there are additional factors that contribute to the deterministic prediction skill.
Analyses using MSESS suggest that the S2S models are not skillful at predicting TC frequency except for week 1 forecasts from the ECMWF model in the NI, AUS, WNP, and ENP and from the BoM model in the SPC and AUS (not shown).
b. Probabilistic prediction: BSS
Next, we investigate the performance of the S2S models in predicting the probability of weekly TC occurrence. Two different Brier skill scores are calculated here. The first (called BSS_c, where “c” stand for “constant”; dashed lines in Fig. 8) compares the Brier score of the forecast to that of a constant, the observed annual mean climatology. Positive values of BSS_c mean that the model is more skillful than a constant climatological prediction. We consider storms with geneses in all months, and with this score, forecasts receive credit for correctly matching the annual cycle in TC genesis frequency. This is consistent with standard practice in the verification of short-term weather forecasts, in which the total values of meteorological variables, as opposed to anomalies from a seasonal climatology, are verified against observations.
The second Brier skill score (called BSS without a subscript here; solid lines in Fig. 8) compares the Brier skill of the forecasts to that of an observed monthly varying climatology. In this case, only storms that formed during the observed TC seasons are considered (as defined in section 3). This is more typical in seasonal predictions and provides a stricter measure for evaluating TC genesis prediction skill. S2S models achieve positive BSS values when they capture deviations from the observed seasonality. A positive BSS in this case means that the model is more skillful than the monthly varying climatological prediction.
The values of BSS_c are often positive, indicating that S2S models are more skillful than an annually constant forecast in most of the TC basins. The BSS_c values are also often noticeably greater than those of BSS because the reference forecast used for BSS_c is less skillful than that of BSS. Forecasts from the CMA and those for the NI are exceptions. In the CMA forecasts, the values of BSS_c are much closer to those of BSS in SIN, AUS, and WNP. This is because the CMA model’s climatology is poor, with too many storms forming during the observed off season (Fig. 3). The differences between BSS_c and BSS in the NI are much smaller than in other basins, and there are no positive BSS_c values after week 1. In other words, none of the S2S models has more skill in predicting the TC season than an annually constant forecast after week 1 in the NI. In contrast, JMA, ECMWF, NCEP, and MetFr all have positive BSS_c in the ATL up to week 5, although ECMWF is the only model that has a positive BSS after week 1.
The BSS values for all models drop significantly from week 1 to week 2 (Fig. 8), similar to results shown for HSS (Fig. 7). This large drop is connected to our genesis definition by accounting for preexisting storms at the time of model initialization, which leads to a high association between forecasts and observations (therefore high BSS). Without including preexisting TCs,1 the week 1 BSS values (the triangle markers in Fig. 8) are close to but (most of them) still slightly higher than those at week 2. By keeping the preexisting storms, we acknowledge that initialization is one of the factors contributing to a dynamical model’s prediction skill. Most S2S models are skillful at week 1 in most basins, with the exception of the NI. Low or even negative week 1 BSS values in a basin are related to poor model initialization in those regions. From weeks 2 to 5, most models’ BSS values level off, with the forecast errors saturating at week 2. This is consistent with the fact that the genesis climatology does not vary significantly after week 2. In some cases, such as for the ECMWF model in the SPC and the NCEP model in the NI, the model error continues to grow and therefore the BSS values decrease with increasing lead time. The MetFr’s BSS values in the SIN and NI fluctuate, with relatively higher values during weeks 3 and 5 than in weeks 2 and 4. We speculate that these fluctuations are not meaningful but result from inadequate sample size.
Based on the BSS evaluation, the ECMWF model has skill up to week 5 in predicting TC occurrence in the ATL and WNP, up to week 2 in the SPC and ENP, but has no skill in the SIN, NI, and AUS after week 1. The BoM model has positive skill in the WNP up to week 5 and in the SPC up to week 2. The MetFr model is skillful up to week 2 in the WNP. The CMA, JMA, and NCEP models have no skill after week 1. Compared to the existing basin-wide statistical models (Leroy and Wheeler 2008; Slade and Maloney 2013),2 the ECMWF, BoM, and MetFr models have comparable prediction skill. At week 1, the BSSs from these multiple logistic regression models are 0.13 and 0.17 in the ATL and ENP (Slade and Maloney 2013), 0.09 in the SIN, 0.06–0.08 near AUS, and 0.045 in the SPC (Leroy and Wheeler 2008). The highest S2S BSS values (from ECMWF) at week 1 are 0.7 in the ATL, WNP, and ENP; 0.35 in the AUS; 0.25 in the SPC; 0.25 in the SIN; and 0.06 in the NI. (Without considering the preexisting storms, they are 0.127, 0.36, and 0.27 in the ATL, WNP, and ENP, respectively; 0.126 in the AUS; 0.07 in the SIN and NI; and 0.01 in the SPC.) At week 2, the statistical models have BSS values of 0.11 and 0.16 in the ATL and ENP (Slade and Maloney 2013), 0.07 in the SIN, 0.05–0.07 near AUS, and 0.001 in the SPC (Leroy and Wheeler 2008). The highest S2S BSS values at the same lead time are 0.15 in the ATL, 0.19 in the WNP, 0.105 in the ENP (from ECMWF), 0.056 in the SPC, and 0.08 in the AUS (from BoM). None of the S2S models has positive skill in the SIN and NI at week 2. From week 3, the BSS values from the statistical models are overall better than those from the S2S models.
c. Potential predictability
The BSS values are low for weeks 2–5 in all global basins, even for those models that are skillful (with BSS above zero in Fig. 8). This raises the question of what are the upper limits of the subseasonal TC genesis prediction in these S2S models. We estimate these limits by computing potential predictability (Buizza 1997).3 For each of the S2S models, reforecasts from one of the ensemble members are treated as a fake “observation,” to which predictions from the rest of the members are verified against using the Brier Skill score (the same as was used for calculating the actual skill). This process is repeated for each ensemble member, and then we average the skill scores. Replacing the observations with a model forecast renders the model “perfect” in that the representation of the atmosphere in the forecast and in the target being forecast then are identical, without any systematic biases. The sources of the errors that remain are the uncertainties in the initial conditions and the unpredictable noise within the model.
Commonly, but not always (Kumar et al. 2014), the potential skill is larger than the actual skill. With the perfect observational data, sufficient ensemble spread, and unbiased models, the positive difference between the potential and actual skill levels can be interpreted as an indication of room for improvement in the models. Under the imperfect situation with possible errors in the observational data, insufficient ensemble spread, and models with systematic biases, the positive difference can be a consequence of these deficiencies. Observational errors lead to artificially low actual skill while insufficient ensemble spread and a biased model can result in artificially high potential skill (Wheeler et al. 2017).
The potential skill score used here is called BSS_p, where the “p” stands for “potential.” As an example, Fig. 9 shows the BSS_p results (dashed lines) of the BoM, ECMWF, and NCEP models in the ATL and SIN. The S2S models perform relatively better in the ATL than in other basins while the SIN is one of the more challenging basins. Similar to BSS, the values of BSS_p (dashed lines) drop significantly after week 1 and level off from weeks 2 to 5. The BSS_p values are positive everywhere for all S2S models at all leads (not shown).
The differences between the BSS_p and the BSS (the gap between the dashed and the solid lines in Fig. 9) vary with lead time, basin, and model. They are usually smallest at week 1, indicating a positive contribution from the initialization. The differences are largest in the NCEP model when compared to those in the ECMWF and BoM. In addition to having low actual skill, the NCEP model has only four members, which might result in insufficient ensemble spread, and therefore artificially high BSS_p. In the three Southern Hemisphere TC basins (figures for SPC and AUS are not shown), the BoM model’s actual skill is close to the respective potential one. The values of BSS are closer to those of BSS_p in the ECMWF model in the ATL. Across models, the positive differences between BSS_p and BSS are largest in the CMA, and smallest in the ECMWF (not shown). Similar to the NCEP, CMA has both low actual skill and only four ensemble members. While there is no easy way to definitely explain the gap between potential and actual skill levels, large differences suggest deficiencies in the model, the ensemble system, or the data.
6. Discussion of the source of probabilistic prediction predictability
The ability of global models to represent TC genesis depends on the model characteristics, including the dynamical core (Reed et al. 2015; Vitart et al. 2001; Kim et al. 2018), the physical parameterizations (Reed and Jablonowski 2011), and the model resolution (Kajikawa et al. 2016). These characteristics are responsible for how well the genesis climatology, the MJO, and the interaction between the MJO and TCs (or interactions between two weather systems in general) are represented in the model. A good TC forecast (at least on weather scales) strongly relies on the model initialization (i.e., the data assimilation scheme). Additionally, the skill of the ensemble prediction system is sensitive to the design of the forecast, such as the ensemble size and the range of the model spread—topics that have been broadly studied within the context of weather (Wilson et al. 1999; Richardson 2001), seasonal (Brankovic and Palmer 1997; Deque 1997; Kumar et al. 2001; Kumar and Chen 2015), and decadal (Sienz et al. 2016) predictions. In this section, we discuss the impact of the model characteristics, initialization, and the ensemble size on the subseasonal genesis probabilistic prediction skill.
a. BSS and climatology, MJO, and MJO–TC relationship
To examine how a model’s TC genesis prediction skill is influenced by its representation of the observed TC climatology, the MJO, and the MJO–TC relationship, we compute the correlation, across lead times and models, between the BSS values (from Fig. 8) and the three verification indices, which are 1) the correlation of the basin-wide, monthly, genesis frequency between simulations and observations for TC climatology, 2) the bivariate correlation of RMM indices for rating the models’ performance on the MJO [same as those shown in Vitart (2017)], and 3) the fraction of variance of the observed MJO–TC relationship explained by the models from the candy plot analysis (Fig. 6). The correlations are calculated using data from weeks 1 to 5 (), as well as from weeks 2 to 5 (), with the latter excluding the influence of the initialization on these measures.
Figure 10 shows the scatterplots of BSS and these three verification indices in the ATL. The BSS values are positively correlated with the verification indices for the TC climatology, the MJO, and the MJO–TC relationship from weeks 1 to 5 (). The positive correlation can be partially attributed to the dependence of both quantities on the lead time; the week 1 BSS and the measure of week 1 genesis climatology (or of MJO and of the MJO–TC relationship) are both higher than those at week 2. Using data from weeks 2 to 5, the positive correlations remain, and the correlation coefficient () becomes smaller, especially the between BSS and the index for the MJO simulation (Fig. 10b). Results from Fig. 10 suggest that the ATL BSS from weeks 2 to 5 is a consequence of the models’ representation of the MJO–TC relationship more than it is a consequence of the relationship between the TC climatology and the MJO. The from BSS and the index for the MJO–TC relationship is above the 90% significance level.
Similar analyses are conducted for other basins, and there is no consistent dependence of BSS on the three indices examined (Fig. 11). In the SIN and NI, the BSS values are positively correlated to the models’ performance in simulating the MJO. The BSSs in AUS and WNP are positively correlated to all three verification indices. The correlation with TC climatology and the MJO–TC relationship is equally strong in the WNP while the correlation with TC climatology is strongest in the ENP. The between BSS and the verification indices for the climatology and the MJO–TC relationship in the WNP and ENP are both above the 90% significance level. It is noted that the correlation does not necessary represent causality. Furthermore, BSS and the three verification indices are not independent of each other, even for those that are above the 90% significance level. Results from Fig. 11 merely show their dependency. There might exist causality or both quantities might be affected by some factors that are not discussed here.
b. BSS and initial MJO magnitude
Belanger et al. (2010) found that the subseasonal Atlantic hurricane prediction in the ECMWF model depends on the MJO magnitude at the time of model initialization. Following their work, we bin the reforecasts by MJO magnitude at the initial time. Our results for the ECMWF model in the ATL (grayish lines in Fig. 9) are consistent with those of Belanger et al. (2010); higher BSSs with increasing MJO magnitude. We, however, do not find such a relationship in other basins. There is no consistent dependence between BSS and the initial MJO magnitude across basins for the BoM and NCEP models (Fig. 9) nor for other S2S models (not shown) either.
c. BSS and ensemble size
Next, we consider the impact of the ensemble size on the forecast skill of subseasonal TC predictions. In particular, we are interested in exploring whether the low skill scores of the NCEP model are due to its small ensemble size (four members) in the S2S reforecasts. (In the climatology and candy plot analyses, the NCEP system’s performance is as good as those of the BoM, and MetFr models, but its skill scores are negative after week 1.) To examine this question, we first reduce the number of ensemble members for all S2S models to four in the BSS calculation. We focus our analysis in the BoM, ECMWF, and MetFr models at weeks 2 and 5, as the CMA and JMA models only have four and five ensemble members, respectively. As expected, the ECMWF, BoM, and MetFr BSS values drop to below zero when only four ensemble members are used in the calculation (Fig. 12). The ECMWF system is still more skillful than NCEP with four ensemble members, as is the BoM model in the basins, where it is skillful with all 33 ensemble members.
We further calculate the BSS values with increasing ensemble size and find that the BoM model reaches a saturation point with roughly 15 ensemble members; that is, further increases in the ensemble size do not benefit the genesis prediction skill. The ECMWF and MetFr models, with 11 and 15 ensemble members, respectively, seem to be close to their saturation points as well, although their BSS values are not flat yet. In other words, the forecast strategy of the ECMWF and MetFr is probably more efficient than that chosen by BoM. Thus, we can expect that the NCEP, JMA, and CMA models will have better skill scores using larger ensemble sizes. This is particularly true in the case of the NCEP model, as the JMA and CMA models have larger biases in their simulated TC genesis climatologies.
The subseasonal prediction skill of TC genesis forecasts, at the basin level, is examined in this study using reforecasts from six different ensemble prediction systems. Skill scores are calculated for deterministic and probabilistic predictions. We compute the potential predictability as well. Most forecasts are skillful for week 1 (days 1–7), the period when initialization is important. The prediction skill drops significantly for weeks 2–5, when the models’ performance is associated with the models’ ability to simulate TC genesis climatology, the MJO, and the interaction between the MJO and TCs.
Deterministically, the S2S models are skillful at predicting basin-wide TC occurrence at all lead times (from weeks 1 to 5), but not skillful at predicting the genesis frequency. From weeks 2 to 5, the Brier skill scores (BSSs) of the probabilistic predictions from all models in the ATL, WNP, and ENP are found to be positively related to how well the global MJO–TC relationship is captured by the model. The BSS is positively related to the models’ performance in simulating the MJO in the ATL, SIN, NI, AUS, WNP, and ENP, and to the accuracy of the simulated TC climatology in the ATL, AUS, WNP, and ENP.
Among the six models, the ECMWF model delivers the best performance in reproducing the observed genesis TC climatology and is skillful in forecasting TC genesis up to week 5 in the ATL and WNP, week 2 in the SPC and ENP, and week 1 for the SIN, NI, and AUS. The BoM system has positive skill up to week 4 in the WNP, week 2 in the SPC and AUS, and week 1 in the ATL and SIN. The MetFr model has skill in the WNP up to week 2, and week 1 in the other basins, except the NI. The CMA, JMA, and NCEP models show no skill in predicting TC genesis from weeks 2 to 5. Among the TC basins, subseasonal TC predictions in the Indian Ocean and southern oceans show the least skill, as most of the S2S models have less skill than the monthly climatological probabilities after week 1. In contrast, more S2S models have positive skill in the North Atlantic and North Pacific basins after week 1.
From weeks 2 to 5, the BSSs in all basins are either close to zero (having little skill compared to the climatological forecast) or below zero (having no skill), indicating the difficulty of subseasonal-scale prediction with the current generation of models. The comparison between actual and potential skill suggests that the S2S models may not have yet reached their limits in predicting TC occurrence in all basins, though some models are close to that mark in some basins. The values of the BSSs are close to their respective potential skill levels in the ECMWF model in the ATL and in the three Southern Hemisphere TC basins in the BoM. The BSSs in NCEP and CMA are most distant from their potential values, though this might be due to their low actual skill as well as the insufficient number of ensemble members, which can result in artificially high potential skill.
While current skill scores are still low, the forecasts are produced directly without any bias correction. As noted by Vitart et al. (2010), the skill of a model can be extended by a few weeks by bias correction through postprocessing techniques based on the past hindcast performance. Furthermore, one can use derived parameters, such as the genesis potential index, rather than the direct output from TC detection, which might have useful results. Even with no bias correction, the most skillful models (ECMWF and BoM) have comparable (or slightly higher) skill to the existing regional statistical models at weeks 1 and 2. It is noted that our results may not reflect the maximum potential skill of some models, such as CMA, JMA, and NCEP, since their S2S reforecast ensembles are small (four or five members).
The research was supported by NOAA S2S Project NA16OAR4310079. Discussions with Drs. Shuguang Wang and Daniel Stern during the course of this study are appreciated. We thank Dr. Carl Schreck and an anonymous reviewer for their comments and constructive suggestions.
A preexisting TC is defined when the storm is identified by the tracker on day 1 with intensity greater than the respective TS threshold and when there is an observed storm greater than 34 kt (the TS threshold for observations) within 500-km distance of the simulated storm.
While the mathematical formulas in the statistical models are similar from one basin to another, the predictors and how sensitive they are varies. Leroy and Wheeler (2008) focused on the southern oceans and used two MJO indices, the ENSO SST index, the Indo-Pacific SST, and the regional TC seasonal climatology. Slade and Maloney (2013), who focused on the Atlantic and east Pacific basins, used MJO and ENSO indices, as well as a regional genesis climatology. For the Atlantic basin, an additional predictor representing the variability of SST in the main development region is used. We do not distinguish between the different statistical models in our discussions but refer the interested reader to those studies.
The potential predictability discussed here is model dependent, not the intrinsic potential predictability of TC genesis.