## Abstract

Probabilistic tropical cyclone (TC) occurrence, at lead times of week 1–4, in the Subseasonal to Seasonal (S2S) dataset are examined here. Forecasts are defined over 15° in latitude × 20° in longitude regions, and the prediction skill is measured using the Brier skill score with reference to climatological reference forecasts. Two types of reference forecasts are used: a seasonally constant one and a seasonally varying one, with the latter used for forecasts of anomalies from the seasonal climatology. Models from the European Centre for Medium-Range Weather Forecasts (ECMWF), Australian Bureau of Meteorology, and Météo-France/Centre National de Recherche Météorologiques have skill in predicting TC occurrence four weeks in advance. In contrast, only the ECMWF model is skillful in predicting the anomaly of TC occurrence beyond one week. Errors in genesis prediction largely limit models’ skill in predicting TC occurrence. Three calibration techniques, removing the mean genesis and occurrence forecast biases, and a linear regression method, are explored here. The linear regression method performs the best and guarantees a higher skill score when applied to the in-sample dataset. However, when applied to the out-of-sample data, especially in areas where the TC sample size is small, it may reduce the models’ prediction skill. Generally speaking, the S2S models are more skillful in predicting TC occurrence during favorable Madden–Julian oscillation phases. Last, we also report accumulated cyclone energy predictions skill using the ranked probability skill score.

## 1. Introduction

Tropical cyclone (TC) predictions are evaluated differently at different time scales. Short-term (weather prediction time scale) track and intensity forecasts are usually verified against best track records at the same time via mean absolute error (e.g., DeMaria et al. 2014). Seasonal storm predictions, on the other hand, are often verified over a basin using correlations of observed and forecast TC counts or accumulated cyclone energy (ACE; e.g., Chen and Lin 2013). Only recently have global weather prediction systems started to generate forecasts at subseasonal time scales (Vitart et al. 2010). Therefore, there are no widely accepted standards for verifying and evaluating subseasonal TC predictions (Camargo et al. 2019). Similarly to short-term weather predictions, Elsberry et al. (2011) and Tsai et al. (2013) verified subseasonal predictions from the European Centre for Medium-Range Weather Forecasts (ECMWF) by comparing the forecast and observed TCs at times and locations at which the storms were very close to each other. Yamaguchi et al. (2015) defined forecasts of weekly storm occurrences over 0.5° × 0.5° grids. Vitart et al. (2010), Camp et al. (2018), and Gregory et al. (2019) examined weekly storm occurrence over 15° in latitude × 20° in longitude boxes with 7.5° and 10° buffer ranges. Others, such as Li et al. (2016), Lee et al. (2018) and Gao et al. (2019) considered basinwide TC activity.

Verification methods are, on one hand, limited by the skill of the forecasts, and on the other hand, they reflect, implicitly, what information is expected from the forecasts. One guiding principle in designing verifications is to consider the desired socioeconomic value of the forecasts. For example, which kind of information would be useful for disaster preparedness with two to three weeks lead time? This information could be used, for example, to plan the distribution and storage of emergency supplies or deploy emergency personnel (Vitart and Robertson 2018). Forecasts of basinwide TC activity clearly do not provide the ideal type of forecast information at these time scales as they do not provide the kind of regional information that is essential for regional disaster preparedness. Conversely, due to the limitations of current prediction systems, it is not reasonable to expect reliable forecasts of the exact time, location or intensity of landfalling TCs weeks in advance. The verification method used by Vitart et al. (2010), Camp et al. (2018), and Gregory et al. (2019) is therefore a reasonable compromise, since it balances the capability of current weather prediction systems with the needs of the user on subseasonal time scales.

Many studies have shown that forecasts of TC position and genesis can have skill beyond 10 days. Elsberry et al. (2011) and Tsai et al. (2013) found that the ECMWF ensembles were able to predict most of the named typhoons’ tracks out to 4 weeks in advance in the 2009 and 2010 Northwestern Pacific typhoon seasons, although there was a 50% false alarm rate. Vitart et al. (2010) showed that a calibration that removes the mean forecast bias could increase the ECMWF’s track predictions skill in the Southern Hemisphere TC basins from 2 to 4 weeks. Similar results are found in two recent papers (Camp et al. 2018; Gregory et al. 2019), which evaluated reforecasts and real-time forecasts of the Australian Bureau of Meteorology seasonal forecasting system (ACCESS-S1) over the Southern Oceans. In the subseasonal to seasonal (S2S) dataset (see section 2), Lee et al. (2018) showed that reforecasts run by six operational centers can predict genesis weeks in advance.

TCs have a strong climatological seasonal cycle, and subseasonal variability of TCs is defined as the anomaly (fluctuation) that deviates from that cycle. Thus, accurately predicting TCs at subseasonal time scales requires models to forecast both the seasonal cycle and anomalies. Generally speaking, global models can predict the seasonal cycle reasonably well because they are good at simulating the low-frequency large-scale atmospheric and oceanic patterns. These large-scale patterns contribute to the predictability of the TC seasonal cycle (Camargo and Barnston 2009; Zhan et al. 2012). The main source of predictability for subseasonal TC variability, on the other hand, is the Madden–Julian oscillation (MJO). Models tend to be more skillful both when the MJO signal is strong during the initial forecast time (e.g., Belanger et al. 2010), and when the MJO is in phases that are favorable to TCs in the basin at the forecast verification time (e.g., Jiang et al. 2012). Tropical waves, such as Kelvin waves and African easterly waves, also influence TC genesis on subseasonal scales (e.g., Ventrice et al. 2012a,b; Schreck 2015). The models’ ability to forecast the large-scale environmental patterns associated with El Niño–Southern Oscillation, the Atlantic meridional mode (e.g., Belanger et al. 2010; Li et al. 2016), as well as extratropical–tropical interactions (Zhang and Wang 2019) influence subseasonal TC predictability as well.

The promising results mentioned above (Vitart et al. 2010; Camp et al. 2018; Gregory et al. 2019; Lee et al. 2018) are based on verifications that credit models for capturing the seasonal cycle and the subseasonal variability. That is to say, forecasts are evaluated against seasonally constant climatological forecasts as a reference. To understand if the S2S models have skill at predicting genesis anomalies, Lee et al. (2018) further used seasonally varying climatological forecasts as a reference (no credit for capturing the seasonal cycle), and showed that the ECMWF model is the only one that has skill in predicting genesis anomalies at 2–3 weeks lead time in most TC basins. Vitart et al. (2010) also discuss the ECMWF model’s prediction skill in Southern Hemisphere TC basins in comparison with seasonally varying climatological forecasts.

The present study is a continuation of Lee et al. (2018), which evaluated the S2S models’ performance in predicting basinwide TC formation. In contrast to Lee et al. (2018), we focus here on 1) the S2S models’ performance in predicting regional TC occurrence (i.e., genesis and subsequent locations) and ACE; 2) applying the various calibration methods, including the one used in Camp et al. (2018), to the forecasts and discussing their impact; and 3) investigating the dependence of the prediction skill on the MJO as characterized by two MJO indices, namely the real-time multivariate MJO index (RMM; Wheeler and Hendon 2004) and the real-time outgoing longwave radiation (OLR) MJO index (ROMI; Kiladis et al. 2014). Data and methods for model evaluation are described in section 2. The models’ performance in storm occurrence is in section 3, followed by discussion of the calibration schemes in section 4. We report the dependence of model skill on MJO in section 5 and the models’ performance in predicting ACE in section 6, followed by the conclusions in section 7.

## 2. Methods

### a. The S2S dataset and observations

We consider the same S2S reforecasts as in Lee et al. (2018), based on coupled, global general circulation models run by six operational centers: the Australian Bureau of Meteorology (BoM), the China Meteorological Administration (CMA), the ECMWF, the Japan Meteorological Agency (JMA), the Météo-France/Centre National de Recherche Météorologiques (MetFr), and the National Centers for Environmental Prediction (NCEP). Basic characteristics of these six reforecasts are shown in Table 1 and further details of the S2S dataset are described in Vitart et al. (2017).

TCs in the S2S models are tracked daily using the methodology of Vitart and Stockdale (2001). The tracker defines a storm center at a local minimum sea level pressure where 1) a local vorticity maximum (>3.5 × 10^{−5} s^{−1}) at 850 hPa is nearby, 2) a local maximum in the vertically averaged temperature (warm core, >0.5°C) in between 250 and 500 hPa is within a distance (in any direction) equivalent to 2° latitude, 3) the two locations detected from criteria 1 and 2 are within a distance equivalent to 8° latitude, and 4) a local maximum thickness between 1000 and 200 hPa can be identified within a distance equivalent to 2° latitude. Additionally, a detected storm must last at least two days to be included in our analysis. The same criteria apply to TCs in all ocean basins.

Observations of tropical cyclone tracks are from the HURDAT2, produced by the National Hurricane Center (Landsea and Franklin 2013), and from the Joint Typhoon Warning Center (Chu et al. 2002). Both best track datasets include 1-min maximum sustained wind, minimum sea level pressure (not used in this study), and storm location every 6 h. Following the conventional definitions (Fig. 1), the TC basins are the following: Atlantic (ATL), northern Indian Ocean (NI), western North Pacific (WNP), eastern North Pacific (ENP), southern Indian Ocean (SIN, 0°–90°E), Australia (AUS, 90°–160°E), and southern Pacific (SPC, east of 160°E). For each basin, we only use forecasts that are initialized during their respective TC seasons: May–November for ATL and WNP, May–October for ENP, April–June and September–November for NI, November–April for SIN and AUS, and December–April for SPC.

### b. Defining forecasts

Following Camp et al. (2018), we subdivide global TC basins into 20° in longitude × 15° in latitude boxes (centers are labeled by circles in Fig. 1). Each box overlaps with its neighboring boxes by 10° and 7.5° in the longitude and latitude direction, respectively. A grid on the border of the two basins belongs to the one on the east and/or on the north side. Thus, the 20° × 15° boxes centered at the equator belong to the Northern Hemisphere basins. Then, we define occurrence forecasts by the fraction of all the ensemble members that contain a TC (ensemble frequency) in individual grids for each of the six models. Similarly, we also define the ACE forecast by the fraction of ensemble members that have weekly ACE exceeding specified thresholds (section 2d) over each box.

Forecasts are evaluated at daily time resolution with a weekly (7 day) window, starting from day 4. In other words, prediction skill at day 4 contains forecasts from day 1 to day 7, prediction skill at day 5 includes forecasts from day 2 to day 8, and so on. Sometimes we also use “week” to describe the forecasts, such that “week 1 forecasts” refers to forecasts containing data from days 1 to 7, “week 2 forecasts” are forecasts from days 8 to 14, and so on. As an example, Figs. 2a and 2b show week-2 occurrence forecasts (in dots) and the gridded occurrence forecasts (in shading) from an ECMWF forecast initialized on 20 August 2005. The observed storm occurrence and ACE are calculated following the same procedure as described above. For convenience, we refer to each of these 20° × 15° boxes as a “region,” and thus “regional” refers to the analyses done over individual boxes.

### c. Defining the MJO

Two real-time MJO indices are considered. The first one is the RMM, which is calculated using intraseasonal zonal winds at 200 and 850 hPa and observed OLR (Wheeler and Hendon 2004; Gottschalck et al. 2010; Vitart 2017). The second MJO index is ROMI, an OLR-based index, calculated from observed intraseasonal OLR anomalies (Kiladis et al. 2014). Wang et al. (2018) showed that ROMI better represents northward propagation of the boreal summer intraseasonal oscillation than RMM.

### d. Skill scores

#### 1) Brier skill score

The Brier skill score (BSS) is used to assess the skill of a probabilistic forecast of TC occurrence relative to a climatological forecast. The Brier score (BS) is defined as

where *N* is the total number of forecasts, *o*_{i} is the *i*th observation. The term *p*_{i} is the predicted probability of TC occurrence for the *i*th forecast, defined as

where *M* is the number of ensemble members, *P*_{i,j} is the TC occurrence prediction from the *j*th ensemble member for the *i*th forecast. The terms *P*_{i,j} and *o*_{i} are 0 for no storm and 1 for one or more storm occurrences during the forecast period. Thus, the BS is the mean squared probability forecast error. When analyzing the models’ performance over individual 20° × 15° regions, *N* in Eq. (1) is the number of forecasts used. When evaluating models’ performance in a basin, *N* is the product of the number of forecasts used and the number of regions in that basin. For example, for evaluating the ECMWF model in the Atlantic basin, *N* is 64 554, which consists of 1218 forecasts across 53 regions. Note that the forecast number, 1218, is different from the one (2058) listed in Table 1, because we only use data during the Atlantic hurricane season.

The BS_{ref} is similar to the BS, but for a reference forecast based on the observed climatology. The observed climatology is calculated using observations over the same period and at the same temporal resolution as the S2S model data. In this study, two climatologies are used. The first one is the seasonally varying climatology at monthly time resolution. The second one is a constant, seasonal-mean climatology. When a model is skillful compared to the climatology, the BSS is positive. For convenience, we refer to the BSS for the monthly varying climatology as BSS_{m}, and the BSS for the seasonal mean, constant climatology as BSS_{c} hereafter. BSS_{c} can be interpreted as the model skill in predicting the absolute TC occurrence, including seasonality. On the other hand, BSS_{m} evaluates the model’s ability to predict the anomalies in TC activity that deviate from the seasonal cycle. The values of BSS_{m} are lower than those of BSS_{c} because its reference forecast (monthly varying mean) is more informative.

#### 2) Ranked probability skill score

To verify ACE predictions (section 6), we use the ranked probability skill score (RPSS). RPSS is a squared-error score for categorical forecasts. The cumulative forecasts *P*_{c}, observations *O*_{c}, and and the ranked probability score (RPS) are denoted as

where *C* is the number of forecast categories and *p*_{j} is the forecast probability of the storm intensity falling in the *j*th category. The observed probability *o*_{j} is 1 if the observations fall in the *j*th category and 0 otherwise. The RPS is the sum of the squared differences between the cumulative probabilities *P*_{c} and *O*_{c}. RPS is oriented so that smaller values indicate better forecasts. A correct forecast with no uncertainty has an RPS of 0. Similar to the BSS, the RPSS compares the average RPS to that of a reference forecast:

We again have two reference forecasts: the first uses the seasonal-mean climatology, the second uses the monthly varying seasonal climatology. They are referred to as RPSS_{c} and RPSS_{m}, respectively. The RPSS is sensitive to the definitions of the forecast categories. Because TCs are rare events, more than 95% of the observations have ACE of 0, and the categories should not be equally spaced, Here, we define six categories, and the first category is for ACE = 0. The other five categories correspond to the 0, 20, 40, 60, and 80 quantiles of the observed distribution of nonzero ACE.

## 3. TC occurrence prediction

TC occurrence predictions are evaluated here from both regional and basinwide perspectives. From a basinwide perspective, the ECMWF model is skillful in predicting TC occurrence (BSS_{c}) at all TC basins up to 4 weeks in advance (Fig. 3). The BoM and MetFr models also have positive BSS_{c} at weeks 1–4 in most TC basins. The JMA model is skillful up to 10 days in all TC basins except the NI. In terms of predicting seasonal anomalies (BSS_{m}), the ECMWF model is skillful up to 2–3 weeks in the WNP, ENP, SIN, and SPC, and 1–2 weeks in the ATL and AUS. Other S2S models have limited skill: the BoM model has positive BSS_{m} in the SIN and SPC at weeks 1–2, the MetFr model is skillful in the SIN and AUS at week 1, and the JMA model is skillful in the SIN and SPC at week 1. The CMA and NCEP models do not have skill in predicting TC occurrence globally. The basinwide prediction skill scores shown in Fig. 3 do not always reflect the models’ performance on the regional scale. For example, while the ECMWF model is skillful in predicting TC occurrence at weeks 1–2 globally, Fig. 4a shows that the model has negative BSS_{c} in parts of AUS (Timor Sea, Arafura Sea, Banda Sea). Similarly, ECMWF model has no skill in predicting TC activity over the Arabian Sea at week 2, but it has an overall positive BSS_{c} in NI. In contrast, the model is not skillful in predicting TC occurrence anomaly in the NI, but is skillful in the Bay of Bengal (Fig. 4b).

The TC occurrence prediction skill scores in the S2S models are qualitatively consistent with those for genesis prediction shown in Lee et al. (2018); both suggest that the ECMWF is the most skillful model and can predict storm activity anomalies with respect to monthly climatology up to 2–3 weeks in advance. This similarity is not surprising as the prevailing circulation associated with the genesis location may influence the subsequent track pattern. Still, it is interesting to know how a model’s occurrence prediction skill is limited by its genesis prediction skill. To address this question, we conduct an additional BSS analysis using the forecasted storms forming within 500 km and ±3 days of the observed TC genesis locations. We keep cases in which the observed genesis is captured by at least one ensemble member. In other words, we are looking at BSS conditioned on the genesis having occurred correctly in at least one of the ensemble members in the forecast (BSS_{m|TC}). One can also think of BSS_{m|TC} as a measure of occurrence forecast skill only with the genesis element removed.

Using the ECMWF forecasts, Fig. 5 shows that the positive BSS_{m|TC} values (gray lines) can last much longer than the positive BSS_{m} values (black lines). In the NI and the three southern basins BSS_{m|TC} is positive from weeks 1 to 4 while BSS_{m} is only positive up to week 2. The increase in the prediction skill is smaller (from a few days to one week) in the WNP, ENP, and ATL. It is well known that TCs are steered by their ambient steering flow (Dong and Neumann 1986) and storm motion forecasts depend upon skillful prediction of the environmental wind field (Galarneau and Davis 2013). While S2S models’ performance on steering flow has not yet been examined in the literature (to the best of our knowledge), the difference between BSS_{m} and BSS_{m|TC} values implies that the ECMWF model may be able to predict the steering flow weeks in advance. An interpretation of Fig. 5 is that the biggest challenge for subseasonal storm occurrence predictions is to forecast genesis well. Vitart and Robertson (2018) also mentioned that if a model can predict genesis correctly, there is a potential for skillful prediction of the subsequent track even at long lead times, at least for long-lived storms. In practice, however, we will not be able to identify which genesis (and subsequent track) predictions are reliable in advance.

## 4. Calibration

Next, we discuss whether the occurrence prediction skills, particularly as measured by the BSS_{m}, can be further improved through a postprocessing calibration. Three techniques are explored here: removing the mean genesis bias, removing the mean occurrence bias, and the linear regression method. In principle, the calibration parameters should be developed using a subset of the entire dataset, known as the “training” or “in-sample” data, and evaluated with the remainder of the dataset, known as the “testing” or the “out-of-smaple” data. Here, we apply a calibration method to the whole dataset and examine the impact of the method in the in-sample dataset. If the results are promising, we will test the method by separating the dataset into in-sample and out-of-sample groups. As shown in this section, we only conduct out-of-sample data evaluation for the linear regression method.

### a. Removing the mean genesis bias

The BSS_{m|TC} results suggest that there is potential to improve the models’ occurrence prediction skill by removing the mean genesis bias—that is, by correcting the mean forecast genesis rate to match the observed one:

Here, the genesis rate is defined as the number of genesis events per day, and the mean genesis bias is the ratio *r*_{gen} between the observed genesis rate $\u2211i=1Noi,gen$ and model simulations $\u2211i=1Npi,gen$ over each region. This ratio is multiplied by the forecast occurrence probability to get the calibrated occurrence probability *p*_{i|gen}. The ratio *r*_{gen} is a function of lead times and regions. The modified forecasts are then used for calculating the Brier skill score for anomalies (BSS_{m|gen}):

Equation (11) is the BSS conditioned on the same genesis rate. Compared to the BSS_{m} (black lines in Fig. 5), BSS_{m|gen} (green dashed lines in Fig. 5) has positive skill in NI and AUS for almost a week longer. In other words, in these two basins the mean genesis biases reduces the ECMWF model occurrence prediction skill by one week. BSS_{m|gen} and BSS_{m} are closer in the WNP, SIN, and SPC than in other basins. In the ENP and ATL, BSS_{m|gen} values are even smaller than BSS_{m}.

### b. Removing the mean occurrence bias

Another common approach for calibrating occurrence forecasts is to remove the mean occurrence biases (e.g., Vitart et al. 2010; Camp et al. 2018). Similar to Eq. (8), the calibrated probability *p*_{i|mean} is derived by multiplying the forecast probability by a ratio, but now it is the ratio *r*_{mean} of mean observed probability and the mean forecast probability:

The ratio *r*_{mean} is also a function of lead times and regions. We follow Camp et al. (2018) and restrict *r*_{mean} to values between 0.5 and 2. For example, a *r*_{mean} value of 3 is changed to 2, and a *r*_{mean} value of 0.02 is changed to to 0.5. This restriction is done to avoid unreasonably large *p*_{i|mean} at areas where the sample size (of TCs) in the forecasts is too small and to avoid forcing the model to predict very small or 0 probability values at regions where the observed sample TC size is small. As mentioned in the introduction, removing the mean occurrence biases increases the ACCESS-S1’s occurrence prediction skill from week 2 to week 5 (Camp et al. 2018; Gregory et al. 2019). Spatial maps of BSS_{m|mean} from ECMWF week-2 forecasts are used to show the impact of this calibration method. The ECMWF week-2 BSS_{m|mean} has positive values in the NI, ENP, SIN, AUS, and SPC (Fig. 6a). When compared to BSS_{m} (Fig. 4b), the calibrated score (BSS_{m|mean}) increases the prediction skill in the Bay of Bengal, western SIN, AUS, and SPC (Fig. 6b). On the basinwide scale, BSS_{m|mean} (green solid lines in Fig. 5) improves the skill of predicting NI, SIN, and AUS storms at all lead times (BSS_{m}) but degrades the skill of predicting WNP, ENP, and ATL storms. In the SPC, it has positive impact on BSS_{m} before day-10 lead time but negative impact afterward.

The results above show that removing the mean occurrence bias does not always have a positive impact on the forecast. This is consistent with Camargo et al. (2019) who showed that this calibration method improves ACCESS–S1 Southern Hemisphere skill scores for long leads in 2017–18 but degrades the skill in 2018–19. Because this calibration method has been used in several studies, we conduct further analysis to understand how it works. First of all, we decompose Eqs. (1) and (2) following Murphy and Winkler (1992) and Murphy (1988):

where *σ*^{2} is the variance, *μ* is the mean, *γ* is the correlation coefficient, and $\u2061()\xaf$ and ⟨⟩ represent averaging over *N* forecasts. The skill score BSS can then be rewritten as

in which the three terms on the right-hand side represent the potential skill (correlations), conditional bias, and unconditional bias (Bradley et al. 2008). To gain higher values of BSS (better prediction skill), a calibration scheme needs to increase the correlation between forecasts and observations, and/or reduce the conditional and unconditional biases. Removing the mean occurrence biases reduces the unconditional bias to zero. However, it also changes the value of *σ*_{p} and therefore does not guarantee a smaller conditional bias. Consequently, Eq. (8) could potentially result in lower values of BSS.

When will BSS_{m|mean} guarantee higher values of BSS_{m}? To obtain the necessary conditions for increasing BSS values, we compare BS and BS_{m|mean} (BS_{m|mean} should be smaller than BS) and obtain the following:

When a model has a positive mean bias, the ratio *r*_{mean} between the mean observed probability and the mean modeled probability has to be smaller than the threshold $(2po\xaf/p2\xaf)\u22121$. On the other hand, when the model is biased low, *r*_{mean} needs to be larger than the threshold. Figures 7a and 7b show the spatial distributions of *r*_{mean} and the threshold. The colorbars in both figures are designed such that for the calibration method to have positive impact, the regions that are red (blue) in Fig. 7a need to be redder (bluer) in Fig. 7b. The comparison is shown in Fig. 7c in which regions where the ECMWF TC occurrence prediction skill can be improved by the calibration method are labeled in red and those where it cannot are labeled in blue. The red and blue areas in Fig. 7c are similar to the reddish and bluish areas in Fig. 6b. Figure 7c also suggests that removing the mean occurrence bias seems to work better when the model mean occurrence forecast is biased low (gray dots in Fig. 7). While not shown here, the blue–red pattern shown in Fig. 7c is model dependent. The impact of the restriction of *r* (0.5–2) on the calibrated forecast skill score is not investigated here but is an interesting question that should be further explored.

### c. Linear regression method

Removing the mean occurrence biases does not always work because it corrects only the mean probabilistic forecast error, but not the mean squared probability forecast error, which is what BSS measures. While one can argue that it is better to use mean error as an evaluation metric instead, BSS is a conventional metric for evaluating the performance of probabilistic forecasts. Therefore, we explore a linear regression-based technique (van den Dool et al. 2017) that minimizes the mean square error. In this approach, the calibrated probabilistic forecast is

where *a* [*a* = *γ*_{p,o}(*σ*_{o}/*σ*_{p})] is the regression coefficient and *b* is the intercept. It is noted that *p*_{i|linear} may be negative or greater than 1 despite the forecast probability being defined between 0 and 1. In this study, we set all the negative *p*_{i|linear} to 0; and 1 if it is greater than 1. For the in-sample data, Eq. (17) can remove the unconditional biases and minimize the conditional biases. The resulting Brier skill score is therefore the potential skill $\gamma p,o2$. Figure 6c shows that the week-2 BSS_{m|linear} for ECMWF model is positive everywhere except the North Atlantic; the ECMWF’s week-2 forecasts of TC occurrence anomaly in the North Atlantic are negatively correlated to observations. The differences between BSS_{m|linear} and the BSS_{m} (Fig. 6d), as expected, show that Eq. (17) improves the ECMWF model’s prediction skill globally. At the basin scale, BSS_{m|linear} also outperforms BSS_{m} (comparing the pink lines to the black lines in Fig. 5).

We further examine the impact of applying Eq. (17) to out-of-sample data. To do so, the first two-thirds of ECMWF forecasts (from 1995 to 2009) are used as training data and the remaining one-third (from 2010 to 2015) is the testing data. When applied to out-of-sample data, Eq. (17) does not guarantee higher prediction skill scores (Figs. 6e,f). This is especially true in regions where the training data are insufficient to capture the statistics of model’s forecast errors, and thus the derived *a* and *b* do not minimize the mean square error of the testing data. In central North Pacific and part of North Atlantic, BSS_{m|linear,out} is smaller than BSS_{m}. At the basin scale, BSS_{m|linear,out} (red lines in Fig. 5) still improves the ECMWF week 2 occurrence prediction skill. The improvement is small in the WNP and SIN, though. The basinwide BSS_{m|linear,out} for all models are shown in Fig. 8. Compared to Fig. 3, applying Eq. (17) seems to improve the S2S models’ occurrence prediction skill in all basins. The improvement is especially evident in the SIN where all the six S2S models are skillful at week 1 with ECMWF, BoM, MetFr, and JMA having skill at week 2. A more sophisticated way to minimize the mean square error is to use logistic regression, which will be explored in the future.

The three calibration techniques used here suggest that calibrating subseasonal, probabilistic TC predictions is not straightforward. A method that works for in-sample data may not work for out-of-sample data, especially regional scales. Further effort is necessary to develop a comprehensive calibration method.

## 5. Dependence of occurrence prediction skill on the MJO

As discussed in the Introduction, the predictability of subseasonal TC activity is commonly related to the MJO phase and amplitude (e.g., Belanger et al. 2010; Jiang et al. 2012). To systematically assess the dependence of the S2S models’ prediction skill on the MJO, we compare the lag relationships of TC occurrence and Brier skill scores to the MJO phases defined by RMM and ROMI (section 2c). To make sure the relationships are not contaminated by the calibration methods, we use the original BSS_{c} and BSS_{m} here.

We start by examining the observed MJO–TC genesis relationship from these two indices using the candy-plot analysis (Lee et al. 2018), a two-dimensional histogram of genesis probability as a function of MJO phases and basins. In Fig. 9, the TC basins are arranged so that the convectively active MJO phases (with black circles) are aligned diagonally. The probability of genesis in convectively active (favorable) MJO phases is higher (red colors) than in suppressed phases (blue colors). The ROMI candy diagram shows more dark red and dark blue circles than does the RMM candy diagram, indicating that ROMI is sharper and better represents the MJO’s modulating influence on TC genesis. The favorable MJO phases defined by ROMI are shifted to the east by one phase in the WNP, SPC, and ENP, compared to those defined by RMM. The lag analysis between TC occurrence and MJO (Fig. 10) shows the eastward shift of the favorable MJO phases from RMM to ROMI as well. This shift may be related to the fact that RMM mostly represents the MJO circulation (Straub 2013; Ventrice et al. 2013), while ROMI represents the MJO convection (Kiladis et al. 2014). Another possibility is the existence of a shift in the geographic locations of the MJO phases associated defined using ROMI compared with those defined using RMM. However, Kiladis et al. (2014) showed that the maximum correlation between OMI (the nonrealtime version of ROMI) and RMM occurs at lags from −2 to 4 days, and thus these two indices do represent MJO phases with similar (while not exactly the same) geographic location.

While not perfect, the candy plot analyses (Fig. 11) suggest that the S2S models capture the shifts of the favorable MJO phases. Except in the JMA model, the pattern correlations between simulated and observed MJO–TC relationships are higher when MJO is defined by RMM than by ROMI. This is an indication that S2S models better simulate the influence of MJO wind signal on TC frequency than they simulate the influence of the MJO convection signal. The CMA and MetFr models are the two extreme cases because their simulations of the ROMI defined MJO–TC relationship yields correlations with observations that are only 11% and 5%, while in the case of RMM the correlation coefficients are 41% and 42%, respectively.

Next, we analyze the contribution of the MJO to S2S models’ prediction skill by grouping the forecasts by MJO phase. Using BSS_{c} as an example, first we calculate the difference of BSS_{c|mjo} (i.e., the BSS_{c} conditioned on the MJO phase) and BSS_{c}: *δ*BSS_{c|mjo} = BSS_{c|mjo} − BSS_{c}. Positive *δ*BSS_{c|mjo} means that forecasts initialized at the MJO phase (denoted mjo) contribute positively to BSS_{c}, which is calculated using the full dataset. Then, we use lag analysis to examine the MJO–BSS_{c} relationship.

Figure 12 shows that the positive *δ*BSS_{c} (red shading) is in phase with the positive TC activity anomalies (black contour) in the ECMWF simulations, when the MJO is defined by ROMI. Similar results are found when MJO is defined by RMM (not shown). In other words, the ECMWF model has better skill in predicting total TC occurrence during favorable MJO phases than unfavorable ones. The pattern correlation coefficients between the relationships of MJO–TC and MJO–BSS_{c} in the seven TC basins from the six S2S models are shown in Table 2. In most cases, the S2S models have positive correlation coefficients, meaning that they likely have better skill in predicting total TC occurrence during favorable MJO phases. Exceptions include the BoM model in the ENP and ATL when the MJO is defined by RMM, and the CMA model in the ENP and ATL when the MJO is defined by ROMI. The relationships between MJO–TC and MJO–BSS_{c} are significant only in a few TC basins in the JMA and NCEP models. In contrast, the relationships between MJO–BSS_{m} and MJO–TC in the ECMWF model are not as strongly in phase (Fig. 13). For the ECMWF model, the pattern correlation coefficients are still positive in most TC basins (Table 3) except in the ENP and SPC when the MJO is defined by ROMI. In the BoM model, the MJO–BSS_{m} relationship is negatively related to the MJO–TC relationship, indicating that the BoM model has better skill in predicting the anomaly of TC occurrence during the suppressed phases than the active ones.

While the impacts of the MJO phase on the prediction skill (whether BSS_{c} or BSS_{m}) vary by basin and by model, Tables 2 and 3 suggest that favorable MJO phases are associated with better forecasting skills for predicting total TC occurrence. Favorable MJO phases are associated with better BSS_{m} in the ECMWF and CMA models in most TC basins but not in other models. It is not clear to us why there is no general relationship between favorable MJO and BSS_{m}, since the MJO is associated with subseseasonal TC variability. Causal connections between the MJO phases and BSS_{c} and BSS_{m} are left for future research.

## 6. ACE prediction

Next, we briefly discuss S2S models’ performance in predicting ACE. As mentioned in section 2, the ACE forecasts are analyzed using RPSS_{c} and RPSS_{m} (section 2d). Due to insufficient horizontal grid spacing, most S2S models are unable to simulate either the TC’s core structure or the occurrence of the most intense TCs. In the case of the ECMWF model, another reason for low-intensity values is that TC occurrence was derived using a 1.5° grid, which corresponds to a lower resolution than the original model grid (0.5°). The strongest TC winds generated by the S2S models are around 50 kt (1 kt ≈ 0.51 m s^{−1}) (Lee et al. 2018), except for the BoM model (60–70 kt), which has 2° horizontal resolution. The BoM model, however, might be reaching higher values of wind speed than expected, as a 2° horizontal resolution model should not be able to generate storms with such strong winds (Davis 2018).

To correct the low-intensity bias in the S2S models, we apply quantile matching, similar to that in Camargo and Barnston (2009). One can also categorize the predicted and observed ACE into 6 categories using their respective thresholds. Here we adjust the forecast intensities before calculating ACE, so that the observed thresholds are used for all models. Results from the RPSS_{c} analyses (Fig. 14) suggest that the ECMWF model is skillful in predicting regional TC intensity in all basins at all leads. BoM and MetFr models are skillful in most TC basins. The prediction skill scores of the NCEP and CMA models are the lowest among the six S2S models, though CMA has positive RPSS_{c} values up to 4 weeks in the SIN. ECMWF has skill in predicting ACE anomaly (RPSS_{m}). In the WNP and SIN, the model is skillful up to 2 weeks, while in other basins only at week 1. In the same way that a model’s occurrence prediction skill is influenced by its ability in capturing the genesis, the S2S models’ skill predicting ACE is influenced by its ability in capturing observed genesis and occurrence. Isolating such impacts is left for a future study, as is the calibration of ACE.

## 7. Conclusions

The subseasonal (week 1–4) prediction skills of probabilistic forecasts of TC occurrence (genesis with subsequent daily position) and accumulated cyclone energy (ACE), at both basin and regional spatial scales, are examined using reforecasts from the BoM, CMA, ECMWF, JMA, MetFr, and NCEP in the S2S dataset. We use the Brier skill score (BSS) for evaluating the TC occurrence predictions, and the ranked probabilistic skill score (RPSS) for ACE. Both quantities are evaluated over 15° in latitude × 20° in longitude regions (Fig. 1). The forecasts are defined as skillful when they outperform the climatological forecasts, defined by either the seasonal mean constant climatology (BSS_{c} and RPSS_{c}) or the monthly varying climatology (BSS_{m} and RPSS_{m}). Thus, BSS_{c} and RPSS_{c} evaluate the models’ ability to forecast the observed TC activity, including its seasonality, while BSS_{m} and RPSS_{m} considers only the TC activity deviation from that seasonality. Additionally, we investigate how the occurrence prediction skill is affected by imperfect genesis predictions and how various calibration schemes impact a model’s prediction skill. We also systematically examine the dependence of S2S models’ prediction skills on MJO phase.

Among the six models examined here, the ECMWF model has the best performance (Fig. 3). It is skillful in predicting TC occurrence up to 4 weeks in all TC basins, except in the NI where the model is skillful up to week 3. The model is also skillful in predicting TC occurrence anomaly 2–3 weeks in advance. Following the ECMWF are the MetFr and BoM models, which are skillful in predicting TC activity 4 weeks in advance in most TC basins. They are not skillful in predicting the TC occurrence anomaly, however. The JMA model is skillful in predicting storm occurrence 2 weeks in advance, while the CMA and NCEP models have no skill in predicting either TC occurrence or anomalies at all TC basins and leads. The prediction skills of the CMA and NCEP models may be limited by their small ensemble sizes as discussed in Lee et al. (2018). In addition to the different ensemble sizes, the S2S data periods are also different, which may also affect the S2S models’ performance. By examining the BSS conditioned on the same TC (no genesis errors), we showed that the most challenging task in subseasonal occurrence predictions is to forecast genesis correctly (Fig. 5). In the case of the ECMWF model, correct genesis predictions can improve prediction skills (for TC occurrence anomaly) from 2 to 4 weeks. The S2S models’ performance for ACE prediction (Fig. 14) follows their performance for the occurrence predictions, since the storm frequency largely influences ACE. The ECMWF, MetFr, and BoM model skillfully predict ACE up to 3–4 weeks. The ECMWF model is the only one that is skillful in predicting the ACE anomaly 2 weeks in advance, however.

Calibration of the mean probabilistic forecast error has been used for improving TC occurrence prediction (e.g., Camp et al. 2018; Gregory et al. 2019). Here we showed that while calibrating the mean bias can reduce the unconditional bias component of the BSS, it does not always lead to a reduction of conditional bias [Fig. 6 and Eqs. (13) and (14)]. As a result, this calibration method may lead to lower BSS values (or worse skill). To know whether a calibration of the mean probabilistic forecast error benefits the BSS evaluation, one can compare the ratio between the mean forecast probability $p\xaf$ and the mean observed probability $o\xaf$ to the threshold $(2po\xaf/p2\xaf)\u22121$ [Eqs. (15) and (16)]. The prediction skill of models with large mean bias, such as CMA and NCEP, can be significantly improved with this calibration method. To calibrate the mean square probabilistic forecast error, the metric that BSS measures, we used the linear regression approach proposed by van den Dool et al. (2017). For the in-sample dataset, the linear regression method improves the S2S model prediction skill globally. For the out-of-sample datasets, this method can improve the models’ skill everywhere, except in areas where the sample TC size is too small.

Next, the dependence of the S2S models’ TC forecast skill on MJO is examined using both RMM and ROMI. The S2S models’ prediction skill in TC occurrence (including the seasonality) is positively related to the favorable MJO phases (Table 2). The relationship between MJO phases and the models’ prediction skill for TC occurrence deviation from the seasonality varies by models and basin (Table 3). This finding is consistent with our previous work on genesis anomaly prediction (Lee et al. 2018), which showed that there is no clear relationship between MJO and genesis prediction skill. An unexpected result is that the ROMI–defined favorable MJO phases have an eastward shift when compared to those defined by RMM (Fig. 9). To the best of our knowledge, there has not yet been a satisfying answer in the literature to explain why this is the case.

Based on our findings and those in Lee et al. (2018), the ECMWF model is the most skillful ensemble prediction system for subseasonal TC genesis, occurrence and ACE forecasts in the S2S dataset, followed by BoM and MetFr. The forecast skill in predicting the anomaly of TC activity from the seasonal climatology remains low, however, even in these models. Genesis prediction is the key bottleneck causing this low prediction skill. Our results highlight the importance of improving our fundamental understanding of TC genesis in order to obtain more skillful subseasonal TC predictions. Calibrating subseasonal probabilistic TC predictions is not easy, but a comprehensive calibration method can largely increase models’ prediction skills and should be further explored in the future. It should be mentioned that this research and Lee et al. (2018) present the prediction skill directly derived from the reforecasts in the S2S dataset. Our results may not reflect the latest prediction skill of the operational centers mentioned here because they may have further improved since the collections of the S2S dataset. Also, reforecasts in the S2S dataset have small ensemble sizes, except for BoM, and both BSS and RPSS punish small ensemble sizes. Such a negative impact maybe even more significant for NCEP and CMA because both models have only four members in the S2S datasets. Variants of the RPSS and BSS (Weigel et al. 2007), which take into account the ensemble size, may be used in the future to examine model skill if the ensemble size was infinite.

## Acknowledgments

We thank the three anonymous reviewers for their thorough reviews. The research was supported by NOAA S2S Projects NA16OAR4310079 and NA16OAR4310076.

*Data Availability Statement:* S2S data and S2S TC tracks are available to research community at http://s2sprediction.net. Best track data for northern Atlantic, and eastern Pacific are available at https://www.nhc.noaa.gov/data/#hurdat and those for Southern Hemisphere, northern Indian Ocean, and western North Pacific are archived at https://www.metoc.navy.mil/jtwc/jtwc.html?best-tracks.

## REFERENCES

*Mon. Wea. Rev.*

*Wea. Forecasting*

*Wea. Forecasting*

*Trop. Cyclone Res. Rev.*

*Quart. J. Roy. Meteor. Soc.*

*J. Climate*

*Geophys. Res. Lett.*

*Bull. Amer. Meteor. Soc.*

*Mon. Wea. Rev.*

*Asia-Pac. J. Atmos. Sci.*

*Mon. Wea. Rev.*

*Geophys. Res. Lett.*

*Bull. Amer. Meteor. Soc.*

*Atmos. Sci. Lett.*

*J. Climate*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*Wea. Forecasting*

*Wea. Forecasting*

*Mon. Wea. Rev.*

*Int. J. Forecasting*

*Mon. Wea. Rev.*

*J. Climate*

*Asia-Pac. J. Atmos. Sci.*

*Wea. Forecasting*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*Quart. J. Roy. Meteor. Soc.*

*Mon. Wea. Rev.*

*npj Climate Atmos. Sci.*

*Mon. Wea. Rev.*

*Bull. Amer. Meteor. Soc.*

*Geophys. Res. Lett.*

*Mon. Wea. Rev.*

*Mon. Wea. Rev.*

*Wea. Forecasting*

*Trop. Cyclone Res. Rev.*

*J. Climate*

## Footnotes

Denotes content that is immediately available upon publication as open access.

*Publisher's Note:* This article was revised on 24 April 2020 to replace Fig. 8, which was missing its axis labels when originally published.