## 1. Introduction

The limited predictability of the atmospheric flow at medium range has led to the development of ensemble prediction systems (EPSs) by the operational meteorological centers [e.g., the European Centre for Medium-Range Weather Forecasts (ECMWF), the National Centers for Environmental Prediction (NCEP), or the Meteorological Service of Canada (MSC)]. One of the main purposes of the EPS is to quantify the different sources of uncertainty associated with the numerical weather forecast (i.e., the approximate modelization of the atmospheric dynamics and the precision of the initial conditions). Each center then has developed (and continues to) its own procedure for the analysis cycle, the numerical model, and the perturbation method applied to build its ensemble. For 15 yr, different ensemble approaches have generally increased the weather predictability up to a 7-day range (e.g., Buizza et al. 2005). One of the alternatives, to improve the ensemble prediction and to extend the range of the weather predictability, is to build a multimodel ensemble or a multiensemble that includes these different analyses, models, and perturbation methods. The multiensemble approach seems to provide benefits for the climate and seasonal prediction fields (Palmer et al. 2004). Following these successes, the multiensemble approach is introduced in the medium-range prediction field, particularly through the research framework of The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE), where different centers exchange their ensemble data to build a global multiensemble (e.g., Park et al. 2008; Johnson and Swinbank 2009; Hagedorn et al. 2010, manuscript submitted to *Mon. Wea. Rev.*).

The North American Ensemble Forecast System (NAEFS) is a multiensemble built with the MSC and the NCEP operational ensembles. A previous study (Candille 2009) has shown the substantial gain obtained by simply combining these two raw EPSs, without any postprocessing corrections. The quasi-perfect dispersion of the MSC EPS combined with the better probabilistic resolution of the NCEP EPS provides improvements in terms of bias, dispersion, global skill [continuous ranked probability score (CRPS); Stanski et al. 1989], reliability, and resolution [probabilistic attributes described in Toth et al. (2003)]. But it is justified to wonder whether the good match—improvements both in terms of reliability and resolution—obtained with the NAEFS is only due to the combination of different biases [biases of opposite signs, as observed in Figs. 3 and 6 in Candille (2009)], that is, the combination of different forecast errors coming from the individual deficiency of each system. In the NAEFS operational process, the same method of bias correction “on the fly” (Cui et al. 2008), where the historical bias is updated at each time step, is applied on each center’s EPS. The purpose of the present study is not to evaluate the bias correction method, which is operationally performed, but to investigate the impact and the relevance of this bias correction in the specific NAEFS context. Section 2 introduces the dataset and the verification package used to evaluate and compare the different EPSs using the tools described in Candille et al. (2007). It also presents the bias correction methodology as applied in the operational NAEFS process. Section 3 briefly shows the basic results about the raw multiensemble experiment, as shown in Candille (2009), but over the present verification period. Section 4 investigates the effects of the bias correction on each EPS component (NCEP and MSC) and on the multiensemble NAEFS. Also, the improvements due to the bias correction method and the raw multiensemble approach are compared in this section. Finally, a discussion and some concluding remarks are presented in section 5.

## 2. Dataset and methodology

### a. Systems and dataset

The MSC and NCEP ensemble are combined in the NAEFS multiensemble. The main characteristics of the MSC EPS (Charron et al. 2010) are the multiparameterization approach, the stochastic perturbation of the tendencies produced by the model physics, and the definition of the initial conditions by an ensemble Kalman filter. On the other hand, the NCEP system (Toth and Kalnay 1997) is built by dynamically perturbing the initial conditions (Wei et al. 2008). The NAEFS multiensemble is thus the unweighted combination of the MSC and NCEP ensembles, both of which will sometimes be referred to as the EPS components of NAEFS.

In this study, all the EPSs are evaluated and compared over the 2008 fall period, from 15 October to 17 December. To minimize temporal autocorrelation of the diagnoses (Candille et al. 2007), the verifications are only performed at every 36 h [i.e., two consecutive verification times are separated by 36 h (15 October at 0000 UTC, 16 October at 1200 UTC, 18 October at 0000 UTC, …)]. For the considered period, that limitation gives 43 verification cases for each forecast range (from 24 to 384 h). The diagnoses are performed against observations, a global network of 636 radiosondes (Elms 2003) that are quality controlled and selected by the MSC operational four-dimensional variational data assimilation (4D-Var) scheme. Three variables—temperature, geopotential height, and zonal wind—are considered at four pressure levels (250, 500, 850, and 925 hPa). Note that the longitudinal wind component is not considered here because all the diagnoses are strictly similar to the zonal wind ones. In addition to global diagnoses, results for regional areas are also introduced in this study: Northern and Southern Hemispheres (northern and southern extratropics higher than 20°N and 20°S, respectively) and the tropics (between 20°N and 20°S). Four different verification datasets are thus defined and are respectively denoted “global,” “NH,” “SH,” and “tropics.”

### b. Verification methodology

_{pot}—also called sharpness or statistical variability—measures the potential skill for a calibrated EPS [i.e., when the reliability (Reli) is null]. Provided that the system is reliable, the potential skill depends on the EPS’s ability to produce predicted ensembles sufficiently different from the climatological distribution (i.e., ensembles with small spread compared to the climatology). Actually, the variability of the predicted ensemble spreads is a determining factor of the system quality. The reliability is also decomposed into bias and dispersion by considering the reduced centered random variable (RCRV): where (

*m*,

*σ*) are the mean and the spread of the predicted ensembles and (

*o*,

*σ*) are the observation and its associated standard error (i.e., the observational error is taken into account in this diagnosis). The bias is the mean of

_{o}*y*averaged over the dataset verification

*a*, and

*b*is actually the forecast error of the ensemble mean normalized by the combination of the ensemble spread and the observational error. The latter represents the characteristic size of the uncertainties considering both forecasts and observations. The dispersion is the standard deviation of

*y*, and

*M*is the number of realizations of the dataset

_{a}*a*;

*d*measures the agreement between the forecast error and the forecast/observation uncertainties. The diagnoses associated with the two first moments of the RCRV and the reliability component of the CRPS provide similar information about the reliability attribute of an EPS. In the following, the diagnoses based on Reli [Eq. (1)] are not mentioned.

_{pot},

*b*, and

*d*. For the comparisons between EPSs A and B, four functions are defined: Absolute values and specific functions are used in Eqs. (7) and (8), respectively, to equitably penalize positive and negative biases, and over- and underdispersion, respectively (Candille 2009). For each Δ-function, positive values mean that EPS B is better than EPS A. Note that only the CRPS Δ-function [Eq. (5)] is linear for the verification dataset (i.e., for two distinct datasets

*a*

_{1}and

*a*

_{2}), where

*w*

_{1,2}is the weight associated with the dataset

*a*

_{1,2}. This is due to the fact that the CRPS calculation depends on two coefficients

*α*and

*β*, which are linearly built [see Eqs. (28) and (29) from Hersbach 2000]. On the other hand, the CRPS partition and CRPS

_{pot}of course depend on

*α*and

*β*, but also on the ratio

*α*/(

*α*+

*β*) [see Eqs. (36) and (37) from Hersbach 2000]. The resolution score is thus no longer linear, neither is Δ

*by extension. The bias*

^{p}*b*is linearly built [Eq. (3)], but because of the absolute value, Δ

*is not [Eq. (7)]. Finally, Δ*

^{b}*is not linear because of its definition [Eq. (8)] and obviously the nonlinearity of the variance [Eq. (4)]. To summarize, only is mathematically true.*

^{d}_{pot}, and is denoted

*I*and

^{c}*I*, respectively: This is the simple algebraic mean of the significant relative differences, that is, the sums are equally weighted and the “no change” category is ignored by choosing

^{p}*w*

_{i,j}= 0, averaged over the forecast ranges

*i*and the pressure levels

*j*for each variable, each measure, and each regional area. For the two first moments of the RCRV, the raw measures

*b*and

*d*can already be considered relative differences (%) compared to the perfectly reliable EPS (i.e.,

*b*= 0 and

*d*= 1). The index is then simply defined by The index

*I*associated with the CRPS for the global area is chosen to summarize the general improvement (or degradation) of a system compared to another. In the following, if it is not explicitly specified, the terms global or globally refer to diagnoses over the global area for the three variables (temperature, geopotential height, and zonal wind). In a similar way, the terms global performance or global gain refer to the CRPS over the global area for the three variables. In the following, all the percentages (except specification) are the indices [Eqs. (11) and (12)] associated with the presented figures.

^{c}### c. Bias correction methodology

*b*= 0 and

*d*= 1 [Eqs. (3) and (4)], so that the two first moments of the predicted ensembles are corrected. These two rigorous methods seem to be expensive in terms of computational time. As an alternative, in the TIGGE context, Park et al. (2008) apply a bias correction computed over a short training period (60 days) before the verification day. In the NAEFS operational context, which is considered in the current paper, the bias correction is computed on the fly by a method described in Cui et al. (2008). This is a correction of the first moment (bias) of the ensemble, a concept also used in Johnson and Swinbank (2009) in a TIGGE study. For each variable, the bias correction is applied independently to every 6-h time step of the forecast. A historical bias field

*B*is available for each forecast range

*δt*, and is replaced by an updated bias field

*B*′ once per day. The update is made after the analysis

*A*(

*t*) and corresponding with the valid time

*t*of the forecast

*F*(

*t*−

*δt*). For each forecast range, the error

*F*(

*t*−

*δt*) −

*A*(

*t*) represents the current bias and it is incorporated into the historical bias field with a weight

*w*of 2% operationally: When a forecast is issued, the most recent historical bias is subtracted from the forecasted value to get the bias corrected forecast. By the nature of the algorithm, the weight of a past forecast in the bias field decreases exponentially with days elapsed since its assimilation into the historical bias. Each center computes the bias correction against its own analysis. Note that only the variables, like temperature and wind vector components, of which the errors are assumed normally distributed are bias corrected. In this study, the bias corrected systems are denoted as MSC-bc, NCEP-bc, and NAEFS-bc, where the latter is the simple combination of the two bias corrected EPS components. Neither particular weights on the EPS components nor variance adjustments have been investigated in this study. According to Johnson and Swinbank (2009), these methods seem to have only weak impacts on the multiensemble performance.

## 3. Basic results

No bias correction is considered in this section. In Candille (2009), a detailed comparison between the MSC and NCEP systems has been presented as well as the positive impact of the multiensemble NAEFS—an unweighted combination of the two EPS components—on the quality of the ensemble forecast. Here, a summary of the probabilistic characteristics of each EPS component is presented. Also, the gain obtained with the multiensemble NAEFS is mentioned. No figure or table is shown in this section because the results are consistent with the previous detailed study, despite different verification periods: summer 2007 for Candille (2009) and fall 2008 for the present study.

Considering the three variables validated in this study, the NCEP system has a large normalized bias [Eq. (3)] compared to the MSC system: generally 20% larger (in absolute value), and particularly, 50% larger in the tropics. Also, the NCEP EPS is substantially underdispersive while the MSC EPS is almost perfectly dispersive. The NCEP predicted ensemble spread only explains about two-thirds of the forecast error. While the MSC predicted ensemble spread represents more than 90% of that error, except in the tropics where the MSC system is clearly overdispersive, especially for the temperature and the geopotential height where the ensemble spread is around 25% too large. This overdispersion seems to be due to the use of four different convective schemes, which increases the spread in the tropics where the convection is the strongest. This large overdispersion in the tropics leads to a lack of resolution (or sharpness; i.e., potential skill). Considering the potential skill, the NCEP system performs 11.4% better than the MSC system in the tropics, and has also a better global resolution of 3.1%. The better resolution of the NCEP system does not balance its global higher bias and lack of spread compared to the MSC system. So, in terms of global skill (i.e., CRPS), the MSC EPS performs 6.6% better than the NCEP system.

By mixing these two EPSs, the resolution should be improved by an increase of the predicted ensembles variability, while at least a nondegradation of the reliability is wished. Actually, the NAEFS multiensemble improves the ensemble performances compared to the EPS components in terms of global skill, both in reliability—essentially the dispersion part—and resolution. As expected, the combination of different forecast errors increases the variability of the predicted ensembles. And the surprising improvement of the reliability could be explained by the different characteristics of the predicted ensembles produced by the two component systems. First, in most of the cases, biases of opposite signs are observed and are then balanced in the multiensemble. This tends to reduce the normalized bias (i.e., mean error) and to improve the dispersion by covering a larger spectrum of the forecast errors. But many cases also exist where the different biases are not balanced. In these situations, the improvement of the dispersion could be due to the fact that the MSC spread is of the same order of magnitude as the distance between the ensemble means of the MSC and NCEP systems and that the NCEP spread is twice as small (not shown). In other words, the NAEFS and the MSC spreads are quite similar. So, even if no general conclusion can be shown for the normalized bias, the combination of different biases improves the coverage of the forecast errors spectrum.

The NAEFS provides 5.6% and 9.5% global gains compared to the MSC and NCEP systems, respectively. This represents a predictability gain of one or two forecast days. As shown in Candille (2009), the NAEFS improvements are not only due to the increasing ensemble size *N*, from 20 members to 40, but also to the intrinsic advantages of mixing dynamical perturbations of the initial conditions (NCEP) and the multiparameterization approach (MSC). As suggested in Park et al. (2008) and Hagedorn et al. (2008), the improved performance of the multiensemble also comes from the fact that the skill difference between the two EPS components is not too large. Actually, the only situation where the global skill is degraded by the NAEFS, compared to each individual system, is for the geopotential height in the tropics. In that particular case, the NCEP system performs almost 50% worse than the MSC system (while this difference is only 6.6% considering the three variables over the global area), and then the NAEFS degrades the CRPS by 9% with respect to the MSC EPS. In all the other cases (areas and variables), the CRPS differences between the two EPS components do not exceed 15%.

These comparisons seem to show that the multiensemble advantages might mainly be due to the combination of different biases. The next section investigates the effect of the bias correction used in the NAEFS protocol on the EPS components and the multiensemble.

## 4. Bias correction comparisons

This section presents the impacts of the bias correction on the EPS components (NCEP/MSC) and on the multiensemble NAEFS, but also a comparison between the bias correction method (i.e., the bias corrected single ensembles) and the multiensemble approach (i.e., the raw multiensemble).

### a. Impacts on NCEP EPS

In the NCEP system, only one dynamical model is used. Therefore, to correct the bias of this system, only one bias estimate has to be obtained and applied. Thus, the forecast error (*o* − *m*) is reduced by applying this correction while the uncertainty *y*) associated with this bias corrected system (NCEP-bc) is smaller than for the raw NCEP system. And then, in addition to an expected reduction of the normalized bias [Eq. (3)], the dispersion [Eq. (4)] should also be improved. Figure 1 shows the qualitative comparison, in terms of RCRV characteristics, for the temperature between the NCEP and NCEP-bc systems. For each forecast range, each pressure level, and each regional area (background box for the global, upper, and lower parts of the sphere for Northern and Southern Hemispheres, respectively, and middle part of the sphere for the tropics), the impact of the bias correction on the NCEP EPS is categorized into three classes, based on the Δ-functions [Eqs. (5)–(8)], as introduced in section 2b: significant improvement (dark gray), significant degradation (light gray), and no significant change (blank). Except in a few cases in the high levels for the first forecast ranges, the normalized bias is significantly improved. This numerically represents a global reduction of 21.5% of the bias [see Eq. (12)]. The impact on the dispersion is not so clear. Globally, and surprisingly, no significant change is observed: the global reduction of the underdispersion is only equal to 1.2%. But a larger positive impact can be noticed in the low levels in Southern Hemisphere (+6.9%). This weak impact on the dispersion tends to show that the order of magnitude of the bias correction is small compared to the one of the NCEP EPS underdispersions.

Table 1 shows the efficiency, depending on the regional areas, of the bias correction over three variables (temperature, geopotential height, and zonal wind). The correction is more accurate in the tropics and Southern Hemisphere than in Northern Hemisphere. This phenomenon could come from the differences between the regional variabilities of the atmospheric flow during the verification period, or most likely from the fact that the initial bias is larger in the tropics and Southern Hemisphere. Table 1 also shows that the large underdispersion of the NCEP system is only slightly adjusted by the bias correction method used here. As mentioned in Hamill and Whitaker (2007), an explicit calibration of the second moment of the probability distributions defined by the predicted ensembles may be necessary.

Since the spread of the predicted ensembles is not modified, no sensitive change is expected on the resolution, and thus the change on the CRPS should only come from the reliability (see Fig. 1). Figure 2 shows the impact of the bias correction for the temperature on the global skill (CRPS) and the potential skill (resolution). That correction improves the global performance of the system (+3.2%), while no clear tendency is observed on the resolution. A slight degradation is even observed (−0.4%) because of a small but negative impact in the tropics. The point that the bias correction can degrade the resolution in a few cases remains unclear. Finally, it seems that both bias and underdispersion reductions lead to a better global skill (+2.6% over the three variables) while the potential skill is approximately constant (+0.8%). Another noticeable fact: the impact on the zonal wind variable is small compared to the temperature and the geopotential height (not shown)—only +0.2% for the CRPS and +3.6% for the reduction of the bias. This is simply due to the fact that the bias for the wind is already small in the raw system.

Looking at details in Figs. 1 and 2, some boxes seem to show contradictory results. For example, considering the bias (Fig. 1, top) of the temperature at 925 hPa for the first forecast range (T925 *D* + 1), a global improvement is observed while no change is noticed in the hemispheres and a degradation is observed in the tropics. This is due to the fact that the diagnoses for each regional area are independently performed, and as mentioned in section 2b, except for the CRPS, the comparison functions are not linear [Eqs. (6)–(8)]. Moreover, for each independent diagnosis, the random draw of the bootstrap procedure is different. Thus, even for the CRPS, seemingly contradictory results may be observed, because the 5%–95% confidence intervals do not have linear behavior. Table 2 shows an explicit numerical example of this point for the case T500 *D* + 6 from Fig. 2 (top). Equation (10) is obviously verified. But the improvement observed for the global area is not significant, even though it is in the Northern Hemisphere.

### b. Impacts on MSC EPS

In the MSC system, the 20 members, which use 20 different combinations of parameterizations with the same dynamical core model, are individually bias corrected. Thus, each bias corrected member tends to converge toward the same reference value (i.e., the most recent bias corrected field estimation), and then becomes closer to each other. In addition to a reduction of the forecast error (*o* − *m*)—actually around 1% of the root-mean-square error (RMSE; not shown)—this process also leads to the reduction of the spread of the predicted ensemble: 2.1% for the temperature, 1.2% for the geopotential height, and almost no change for the zonal wind (not shown). This thus leads to a reduction of the uncertainty *y*) and its first two moments is not foreseeable in this situation. Figure 3 (top) shows for the temperature that the normalized bias is globally improved, except for the higher level. The correction leads to a global reduction of 5.2% of the bias. On the other hand, the dispersion (Fig. 3, bottom) is globally degraded (−1.2%). In the situations where the MSC EPS is slightly underdispersive, the correction tends to increase that underdispersion. But, in the tropics, the bias correction reduces the large overdispersion and has a large positive impact on the score (+10.1%).

Also, a weak but positive impact is observed (not shown) on the global skill (+1.1% over the three variables) and the potential skill (+0.7%). Note that the reduction of the overdispersion in the tropics, providing a more reliable ensemble by reducing the ensemble spread, leads to a marked improvement of the resolution in this area (+4.6%).

Finally, when the two bias corrected systems are compared (not shown), the normalized bias differences over the three variables (i.e., the distance between the two normalized biases) are greatly reduced from 20% to 7% (still an advantage for the MSC system). This reduction is even larger in the tropics where these differences in scores go from 50% to 10% (not shown). For the other scores—dispersion, CRPS, and resolution—these differences are comparable to the ones noticed for the raw systems. For instance, the CRPS advantage for the MSC EPS is slightly reduced from 6.6% to 4.7% (not shown). This means that, despite the application of the bias correction, the main typical differences between the two EPS components remain the same.

In conclusion, the bias correction is obviously effective in reducing the normalized bias, especially when the ensemble spread is not modified in the process (NCEP system). Otherwise, its impact on the dispersion is weak, except in the overdispersion situation observed in the tropics for the MSC system. In this case, the correction also leads to a larger improvement of the potential skill. This is the only situation where that correction has a substantial impact on the global or the potential skill of the EPSs.

### c. Impacts on NAEFS

In the NAEFS context, the simple combination of the raw EPS components improves the global and the potential skills, but also the reliability, especially the dispersion part. The latter result is a bit surprising, considering the large underdispersion of the NCEP system and the almost perfect dispersion of the MSC system. As mentioned in section 3, this suggests that the NAEFS dispersion improvement could come from the combination of the two different biases (of opposite signs or not). By extension, the combination of different biases could explain the success in the skill of the multiensemble. The raw multiensemble and the multiensemble combining the bias corrected EPS components, called the bias corrected multiensemble and denoted NAEFS-bc, are compared to quantify the role of the combination of different biases in the multiensemble’s success.

First, the characteristics of the multiensembles’ (NAEFS and NAEFS-bc) spread are presented. The bias correction results in a noticeable reduction of the distance between the ensemble means of the two EPS components: 20% for the temperature, 8% for the geopotential height, and only 1% for the zonal wind (not shown). On the other hand, the ensemble spreads remain nearly constant [i.e., no change for the NCEP EPS by construction and less than 2% reduction for the MSC system (not shown)]. These two facts result in a decrease of the NAEFS-bc spread: 4% and 2% for the temperature and the geopotential, respectively, and less than 1% for the zonal wind (not shown).

Figure 4 shows the impact of the bias correction on the two first moments of the RCRV for the multiensemble NAEFS (temperature). The normalized bias [Eq. (3)] is generally reduced (+5.1%), while the dispersion [Eq. (4)] is globally degraded (−3.6%), because of the decreasing multiensemble spread in an underdispersive situation (i.e., everywhere but in the tropics). Note that similar results are observed for the geopotential and that there is no significant change for the zonal wind. The dispersion degradation shows that once the bias differences are partially removed, the NAEFS multiensemble no longer represents the full spectrum of the forecast errors so well. This means that the difference of the biases between the NCEP and MSC systems characterizes one of the aspects of the multiensemble success. But, on the other hand, the reduction of the normalized bias in NAEFS-bc shows that the combination of forecast errors increases the mean error (i.e., the normalized bias). This means that the different biases are not fully balanced and their combination can also cause a degradation of the quality of the ensemble system. In the tropics, as in the MSC case, the reduction of the NAEFS spread adjusts the overdispersion and the dispersion is improved by 9.7% and 13.4% for the temperature and the geopotential, respectively.

Figure 5 shows the effects of the bias correction on the global and potential skills of the multiensemble (temperature). Generally, the changes, while statistically significant, are very small. The improvements only represent 0.3% and 0.2% for the CRPS and the resolution, respectively. Except for the higher level (250 mb), the improvements are essentially observed for the first four forecast day ranges. These are equal to 1.3% for the CRPS and 1.1% for the resolution while these changes only represent 0.2% and less than 0.1%, respectively, over the other ranges. Despite a partial removal of the bias differences, the correction does not affect the statistical variability or the global skill of the multiensemble. The intrinsic quality of the NAEFS does not only depend on the combination of the different biases, but also on the variability of the spreads of the EPS components. In the tropics, the correction of the overdispersion leads to a positive impact on the resolution (+4.1%). Over the three variables, the global skill improvement is 0.6% with a peak at 7.5% in the tropics.

Finally, the bias correction has a weak impact on the multiensemble, except in the tropics because of the large MSC overdispersion in temperature. The global skill, the resolution, and consequently the global reliability [see Eq. (1)] are not changed despite the fact that the bias correction largely reduces the distance between the ensemble means and partially removes the difference between the biases of the EPS components. The success of the multiensemble NAEFS does not only come from the combination of different forecast errors.

### d. Bias correction versus multiensemble

To obtain fair comparisons between the bias correction method and the combined system approach, a raw reduced multiensemble NAEFS, with 20 members in total, is considered in this subsection. This EPS (denoted NAEFS redux) is built by mixing 10 members of the MSC system and 10 members of the NCEP one, which are randomly drawn at each realization of the prediction process (Candille 2009). In this subsection, graphics summarizing the relative differences—expressed in percentage [see Eqs. (11) and (12)]—for each variable and each regional area are presented. Positive values correspond to an advantage of the multiensemble approach. Figure 6 (top and bottom, respectively) shows the relative differences between each bias corrected EPS component and NAEFS redux in terms of normalized bias and dispersion. Globally, the bias correction method has a greater positive impact on the normalized bias than the combination of the forecast errors. This impact is particularly large in the tropics and Southern Hemisphere for both center systems. But in a few cases, the multiensemble approach provides a better bias correction. This is observed in the Northern Hemisphere for the three variables in the NCEP case and for the temperature and the geopotential in the MSC case. Also, for the zonal wind in the tropics, the multiensemble presents a reduced bias compared to the bias corrected EPSs. For the dispersion, the advantage of the multiensemble is outstanding in the NCEP case (between 25% and 30%) compared to the differences observed in the MSC case. This is not really surprising considering the fact that the bias correction barely reduces the underdispersion of the NCEP system (see section 4a) and slightly modifies the dispersion of the MSC EPS. Once again, the large overdispersion, especially for the temperature, of the MSC system in the tropics is noticeable in these results. The bias correction better adjusts this defect than the multiensemble in the MSC case. Also, in the NCEP case, the bias correction provides a better dispersion for the temperature in the tropics, showing how dominant for the multiensemble the large overdispersion of the MSC system is.

Figure 7 (top and bottom, respectively) shows the same comparisons in terms of global (CRPS) and potential (resolution) skills. Except for the tropical case, the multiensemble provides better performances than the bias corrected EPSs: +3.2% and +6% (global CRPS) compared to the bias corrected MSC and NCEP systems, respectively. Note that these improvements are even larger when the full raw multiensemble (i.e., with 40 members in total) is compared to the bias corrected EPS components MSC-bc and NCEP-bc: +5% and +7.6%, respectively (not shown). This increase is approximately of the same order of magnitude as the CRPS change when the ensemble size increases from 20 to 40 members (Candille 2009).

Despite a worse normalized bias in the sensitive areas (tropics and Southern Hemisphere), the multiensemble approach performs better than the bias corrected systems. In the NAEFS framework, the raw combination of the two systems appears to be the best ensemble option, since no postprocessing is needed in this approach, even if the bias correction on the fly as computed in that process is not too expensive. The main characteristics of the improvements, compared to the EPS components, resulting from the raw multiensemble (section 3) remain similar when the bias corrected EPS components and multiensemble are considered (see Table 3).

These results confirm the weak differences between the NAEFS and the NAEFS-bc noticed in section 4c. The small amplitude of the bias correction impact on the multiensemble is even more obvious when it is presented (Fig. 8) in the same way as is done in this subsection. Except in the critical area of the tropics, the differences between the NAEFS and the NAEFS-bc are barely noticeable in the global and potential skills compared to the gains observed in Fig. 7. For the tropics, it has been shown in sections 3, 4b, and 4c that the improvements observed in the NAEFS multiensemble are entirely due to the fact that the large overdispersion of the MSC EPS is efficiently reduced by the bias correction method. This means that in this multiensemble context, the efficiency of the bias correction is only noticeable when it is able to correct a large deficiency from one of the EPS components (here, the MSC overdispersion in the tropics).

## 5. Summary and concluding remarks

Previous studies about raw ensemble prediction systems and multiensembles have shown the substantial benefit of considering the multiensemble approach. One can wonder whether these multiensemble advantages are due to the cancellation of the opposing biases (i.e., different forecast errors coming from different model errors). In the operational NAEFS context, to remove these forecast errors, a bias correction “on the fly” is applied to the NCEP model and to the 20 parameterizations of the MSC core model. That correction is effective on the single EPSs. The normalized bias is obviously reduced, but also the global skill (CRPS) and the potential skill (resolution) are improved. This is especially noticeable in the tropics where the lack of skills and the mean error are greater than in the Northern and Southern Hemispheres.

The weak differences observed in the general case between the raw and the bias corrected NAEFS multiensembles mean that the main “ingredients” for probabilistic gains are already in the combination of the uncorrected EPSs. The multiensemble success is basically due to the reduction of the mean error (by the cancellation of the opposing biases), to the larger coverage of the forecast errors, and especially to the increasing variability of the predicted ensembles. In the NAEFS context, the bias correction partially removes the differences between the forecast errors from each EPS component. This leads to a reduction of the mean error, but also to a reduction of the forecast error coverage, without affecting the global skill and the statistical variability of the multiensemble system (a slight improvement is even noticed). This shows three points:

- The cancellation of the opposing biases in the NAEFS is not systematic. The reduction of the mean error is not optimal by simply mixing different biases, even if they are of opposite signs in most of the cases.
- The distance between the different biases in the NAEFS actually provides an adequate coverage of the forecast errors (i.e., a better dispersion).
- The bias correction substantially decreases the distances between the forecast errors from each EPS component. But the statistical variability of the predicted ensembles in the NAEFS is not changed, meaning the combination of the different ensemble spreads essentially explains the variability of the NAEFS multiensemble.

The improvements obtained by the bias correction on the fly and by the system upgrades of each center could be compared, even if the purposes of the two processes are different: the calibration can be seen as a statistical adaptation that resolves phenomena that cannot be resolved by the model upgrade. The efficiency of the bias correction on the MSC system in the current study is then compared to the improvement due to the last major MSC EPS upgrade (Charron et al. 2010) and observed in the previous study (Candille 2009). The two verification periods are different (summer 2007 and fall 2008) but the two studies are statistically comparable in terms of number of realizations of the MSC EPS. Globally, the bias correction improves the global skill (CRPS) by 1.1% and 0.6% for the MSC system and the NAEFS, respectively, while the 2007 summer MSC upgrade has improved the CRPS by 4.6% and 2.2%, respectively (not shown). That major upgrade has provided a gain 4 times greater than the gain provided by the bias correction applied in the operational NAEFS process. The upgrade improvement is very similar to the one due to the raw multiensemble NAEFS (globally 4.3% in CRPS compared to the MSC EPS).

Finally, it has been shown that generally the bias correction, as performed in the NAEFS operational process, has a weak impact on the EPS performances compared to the multiensemble approach. The bias correction does not fundamentally change the quality of the multiensemble but is able to patch a lack of skill of the system when a large deficiency exists, as for the MSC overdispersion in the tropics. The bias correction can then be seen as protection against these kinds of deficiencies. The weak impact on the general quality of the EPSs does not mean that this particular calibration is useless, but maybe obsolete for this multiensemble context since postprocessing methods on the dispersion characteristics could be investigated in the NAEFS context. On the other hand, the multiensemble can be seen as a free calibration since almost no postprocessing is needed; that is a clear advantage in operational frameworks. Although technical problems related to installing a reliable data transfer of model outputs are less and less an issue with technological advances, it is still a challenge. In those particular studies about NAEFS [the current one and Candille (2009)], the multiensemble gains are even comparable to a major MSC system upgrade, which may be the result of a few years of research and development work.

## Acknowledgments

The authors thank Renate Hagedorn and Peter Houtekamer for fruitful exchanges. They also thank the anonymous reviewers for their constructive comments and suggestions.

## REFERENCES

Buizza, R., , P. L. Houtekamer, , Z. Toth, , G. Pellerin, , M. Wei, , and Y. Zhu, 2005: A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems.

,*Mon. Wea. Rev.***133****,**1076–1097.Candille, G., 2009: The multiensemble approach: The NAEFS example.

,*Mon. Wea. Rev.***137****,**1655–1665.Candille, G., , C. Côté, , P. L. Houtekamer, , and G. Pellerin, 2007: Verification of an ensemble prediction system against observations.

,*Mon. Wea. Rev.***135****,**2688–2699.Charron, M., , G. Pellerin, , L. Spacek, , P. L. Houtekamer, , N. Gagnon, , H. L. Mitchell, , and L. Michelin, 2010: Toward random sampling of model error in the Canadian ensemble prediction system.

,*Mon. Wea. Rev.***138****,**1877–1901.Cui, B., , Z. Toth, , Y. Zhu, , and D. Hou, 2008: Statistical downscaling approach and its application. Preprints,

*19th Conf. on Probability and Statistics,*New Orleans, LA, Amer. Meteor. Soc., 11.2.Efron, B., , and R. Tibshirani, 1993:

*An Introduction to the Bootstrap*. Chapman & Hall, 436 pp.Elms, J., 2003: WMO catalogue of radiosondes and upper-air wind systems in use by members in 2002 and compatibility of radiosonde geopotential measurements for period from 1998 to 2001. Instruments and Observing Methods, Rep. 80, TD 1197, World Meteorological Organization, 12–21.

Hagedorn, R., , T. M. Hamill, , and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures.

,*Mon. Wea. Rev.***136****,**2608–2619.Hamill, T. M., , and J. S. Whitaker, 2007: Ensemble calibration of 500-hPa geopotential height and 850-hPa and 2-m temperatures using reforecasts.

,*Mon. Wea. Rev.***135****,**3273–3280.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15****,**559–570.Johnson, C., , and R. Swinbank, 2009: Medium-range multimodel ensemble combination and calibration.

,*Quart. J. Roy. Meteor. Soc.***135****,**777–794.Palmer, T. N., and Coauthors, 2004: Development of a European multimodel ensemble for seasonal-to-interannual prediction (DEMETER).

,*Bull. Amer. Meteor. Soc.***85****,**853–872.Park, Y-Y., , R. Buizza, , and M. Leutbecher, 2008: TIGGE: Preliminary results on comparing and combining ensembles.

,*Quart. J. Roy. Meteor. Soc.***134****,**2029–2050.Stanski, H. R., , L. J. Wilson, , and W. R. Burrows, 1989: Survey of common verification in meteorology. World Weather Watch Rep. 8, TD 358, World Meteorological Organization, Geneva, Switzerland, 114 pp.

Talagrand, O., , R. Vautard, , and B. Strauss, 1997: Evaluation of probabilistic prediction systems.

*Proc. Workshop on Predictability,*Reading, Berkshire, United Kingdom, ECMWF, 1–26.Toth, Z., , and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method.

,*Mon. Wea. Rev.***125****,**3297–3319.Toth, Z., , O. Talagrand, , G. Candille, , and Y. Zhu, 2003: Probability and ensemble forecasts.

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science,*I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley & Sons, 137–163.Wei, M., , Z. Toth, , R. Wobus, , and Y. Zhu, 2008: Initial perturbations based on the ensemble transform (ET) technique in the NCEP global operational forecast system.

,*Tellus***60A****,**62–79.

Summary of the impacts [Eq. (12)] of the bias correction on the normalized bias [Eq. (3)] and the dispersion [Eq. (4)] of the NCEP EPS depending on the regional area.

Impacts of the bias correction on the NCEP EPS. CRPS for temperature at 500 hPa and forecast range 6. Here, nb is the number of verifications for each regional area, Δ* ^{c}* is the comparison function from Eq. (5), and Binf and Bsup are, respectively, the inferior and the superior limits of the 5%–95% confidence interval associated with Δ

*(reminder: the difference is significant when the two limits have the same sign).*

^{c}