## Abstract

This study experimented with a unified scheme of stochastic physics and bias correction within a regional ensemble model [Global and Regional Assimilation and Prediction System–Regional Ensemble Prediction System (GRAPES-REPS)]. It is intended to improve ensemble prediction skill by reducing both random and systematic errors at the same time. Three experiments were performed on top of GRAPES-REPS. The first experiment adds only the stochastic physics. The second experiment adds only the bias correction scheme. The third experiment adds both the stochastic physics and bias correction. The experimental period is one month from 1 to 31 July 2015 over the China domain. Using 850-hPa temperature as an example, the study reveals the following: 1) the stochastic physics can effectively increase the ensemble spread, while the bias correction cannot. Therefore, ensemble averaging of the stochastic physics runs can reduce more random error than the bias correction runs. 2) Bias correction can significantly reduce systematic error, while the stochastic physics cannot. As a result, the bias correction greatly improved the quality of ensemble mean forecasts but the stochastic physics did not. 3) The unified scheme can greatly reduce both random and systematic errors at the same time and performed the best of the three experiments. These results were further confirmed by verification of the ensemble mean, spread, and probabilistic forecasts of many other atmospheric fields for both upper air and the surface, including precipitation. Based on this study, we recommend that operational numerical weather prediction centers adopt this unified scheme approach in ensemble models to achieve the best forecasts.

## 1. Introduction

Ensemble prediction has become one of the most important components of numerical weather prediction (NWP) (Buizza et al. 2018). Many methods have been proposed for perturbing initial conditions and models (e.g., Tracton and Kalnay 1993; Houtekamer et al. 1996; Chen et al. 2002; Shutts 2005; Ma et al. 2008, 2015; Berner et al. 2009; Kazuo et al. 2012; Ollinaho et al. 2017; Feng et al. 2014, 2018). For a complete review of current ensemble methods, readers are referred to Du et al. (2018). All NWP modeling systems produce two kinds of forecast errors: random and systematic errors. The ensemble methods are designed to deal with random but not the systematic errors (Du 2007). Since systematic error (bias) will inversely impact the quality of ensemble forecasts and the capability of truly assessing an ensemble prediction system (EPS; Wang et al. 2018), it would be ideal if model bias can also be reduced or removed in an EPS. Currently, the removal of model’s systematic bias is usually achieved by statistical methods and done separately and independently as a post processing step following model integration (e.g., Gneiting et al. 2005; Raftery et al. 2005; Monache et al. 2006; Bakhshaii and Stull 2009; Cui et al. 2006, 2012; Du and Zhou 2011; Li et al. 2011). In other words, the treatment of random and systematic errors cannot be done at the same time using a single unified scheme within an ensemble model in current NWP, which prevents bias-corrected fields from being used in some downstream applications (see the discussions of the next paragraph and section 3c).

Very recently, we introduced a three-dimensional wholesale-like dynamical method able to remove systematic model bias for all variables at the same time, as an integral part of model integration (Chen et al. 2019, manuscript submitted to *Quart. J. Roy. Meteor. Soc*., hereafter C19). As we discussed in C19, the advantage of this dynamical bias correction approach is threefold: 1) convenience (the two steps of “model integration and postprocessing” become one step “model integration”), 2) improvement in forecast products derived from multiple variables (due to consistency among the variables), and 3) the capability of initializing a downstream model (including two-way nested regional modeling) with dynamically consistent bias-corrected fields. This approach also makes it possible to deal with both random and systematic errors in a unified scheme during model integration. In this study, we will design and test such a unified scheme by applying the method of C19 and a stochastic physics scheme at the same time to an EPS, aiming to reduce the systematic and random errors simultaneously. To better understand how the unified scheme works, we will compare the performances of each of the components involved. Specifically, the stochastic perturbed parameterization tendency component (SPPT) alone will first be implemented on top of a base EPS (control experiment) to examine how it impacts the performance of the EPS. Then the bias correction component alone will be implemented. Finally the two components will be implemented together as a unified scheme (i.e., the bias correction will be performed before the SPPT is performed).^{1} The rest of this paper is organized as follows. Section 2 describes the model, experiment design, and data. The results are presented in section 3. A summary is given in section 4.

## 2. Model, experiment design, and data

### a. The base model (GRAPES) and the control EPS (GRAPES-REPS)

The base model in this study is a regional version of the Global and Regional Assimilation and Prediction System (GRAPES), which is developed at the Numerical Weather Prediction Center of the China Meteorological Administration (CMA; Chen et al. 2008). The main features of GRAPES include a fully compressible dynamical core with nonhydrostatic approximation, a semi-implicit and semi-Lagrangian scheme for time integration, and a height-based terrain following coordinate. The model physics includes Rapid Radiative Transfer Model (RRTM) longwave radiation (Mlawer et al. 1997), Dudhia shortwave radiation (Dudhia 1989), WSM-6 microphysics (Hong and Lim 2006), Noah land surface model (Mahrt and Ek 1984), the MRF PBL scheme (Hong and Pan 1996), and Monin–Obukhov surface layer scheme (Noilhan and Planton 1989). Model analysis is produced by a three-dimensional variation data assimilation scheme (Zhuang et al. 2014).

The control experiment (CTL) is the GRAPES-based regional EPS (GRAPES-REPS), which has been running operationally at CMA since August 2014 (Zhang et al. 2014). It has 15 members (1 control and 14 perturbed members) covering the China domain (15°–64.35°N, 70°–145.15°E). The horizontal resolution is 15 km with 51 vertical levels (model top is 10 hPa). It runs twice a day (initialized at 0000 and 1200 UTC) out to 72 h of forecast length with 6-hourly output (model integration time step is 60 s). The lateral boundary conditions (LBCs) and initial conditions (ICs) of the GRAPES-REPS members are diversely provided (directly downscaled) from the different members of the GRAPES global EPS (Ma et al. 2008) that also runs operationally at CMA. The initial condition perturbations of the GRAPES global EPS are generated by the breeding vector (Toth and Kalnay 1997). GRAPES_REPS applies a multiphysics approach (Stensrud et al. 2000; Du et al. 2003) for its physics perturbation through a combination of two boundary parameterization and four convective cumulus parameterization schemes (Table 1). The GRAPES-REPS will be used as a reference (CTL) for the three other experiments (Table 2) in the comparisons.

### b. Stochastic physics experiment (GRAPES-REPS-SPPT)

The first experiment (Exp. 1) is the stochastically perturbed parameterization tendency (SPPT). SPPT is a commonly used approach for perturbing model physics (e.g., Buizza et al. 1999, 2018; Christensen et al. 2015). Yuan et al. (2016) applied the SPPT (Charron et al. 2010) to the GRAPES-REPS as follows:

Equation (1) is the model integration equation. Here, *A* is a state variable of ensemble member *j* at integration time *t*, where *j* = 0, 1, 2, ..., 14, and *j* = 0 represents the control forecast; $B^l\u2061(\theta 0,0)$ is the dynamic tendency, and *P* is for the physics tendency.

In the SPPT scheme, the physics tendency is multiplied by a random field as follows:

where *R*_{j}(*λ*, *ϕ*, *t*) gives the random noise for member *j*. Here, *R*_{j}(*λ*, *ϕ*, *t*) is described by the first-order Markov chain with spherical harmonics expansion, and has a time–space correlated continuous and smooth horizontal structure, defined as

In Eq. (3), $r\xaf$ is the mean of *R*(*λ*, *ϕ*, *t*) and *α*_{l,m}(*t*) is the spectral coefficient as a function of *t*. The longitude and latitude are represented by *λ* and *ϕ*, respectively. The spherical harmonics is expressed by *Y*_{l,m}, where *l* and *m* indicate the total horizontal and latitudinal wavenumbers, respectively. The term *L* is the spatial truncation scale (i.e., wavenumber) of the random field [*L* = 28 in this study, adopted from Charron et al. (2010)]. The evolution of *α*_{l,m}(*t*) is obtained by the first-order Markov chain:

In Eq. (4), Δ*t* is the model integration time step (60s in this case). The term *τ* is the time scale of temporal correlation for the random field [*τ* = 6 h in this study; refer to Yuan et al. (2016)]. The term *r*_{l,m}(*t*) is a Gaussian distribution with a mean of 0 and variance of 1. Here, *σ* is the standard deviation of *R*(*λ*,*ϕ*, *t*), which is also a Gaussian distribution (i.e., the uncertainty distribution of a particular parameter) [*σ* = 0.27 in this study, adopted from Charron et al. (2010)].

As is described in Li et al. (2008), one often needs to keep a perturbed parameter within some specified bounds. Thus, a stretching or dilation transformation $\chi \u2061(R,r\xaf)$ is applied to *R*(*λ*,*ϕ*, *t*) to obtain a new random field *R*′(*λ*, *ϕ*, *t*),

which can ensure that the new random field lies within specified bounds and has the ability to modify its probability density function. The dilation transformation is

Here, $r\xaf=\u2061(Rmax\u2032+Rmin\u2032)/2$, and $Rmax\u2032$and $Rmin\u2032$ are the upper and lower thresholds for *R*′(*λ*, *ϕ*, *t*), respectively. In this case, $Rmax\u2032$and $Rmin\u2032$ are set to 1.8 and 0.2, respectively, following Yuan et al. (2016). Here, *β* is a constant equal to −1.27 [as in Li et al. (2008)]. Thus, the SPPT scheme is implemented by multiplying the physics tendency term *P* by *R*′(*λ*, *ϕ*, *t*) as follows:

This is applied to four model state variables (zonal wind *u*, meridional wind *υ*, potential temperature *θ*, and specific humidity *q*) for the GRAPES model.

### c. Bias correction experiment (GRAPES-REPS-LTBC)

The second experiment (Exp. 2) is the linear tendency bias correction (LTBC). This method subtracts a linear bias tendency directly in the model tendency equation at each time step during model integration (C19):

The $B^l\u2061(Sj,t)$ is the linear bias tendency estimated from past forecasts:

Linear regression is used to estimate the bias increment $b^$ over a time window Δ (in hours) by linearly fitting all bias values within the window. The *δt* is the model integration time step (in seconds). As discussed in C19, for the feasibility of operational implementation as well as the similarity in biases for all members, the control member’s bias is used for the 14 perturbed members (i.e., *j* = 0). For a detailed description of the LTBC and its test results, readers are referred to C19 where three different experiments were carried out. For the reader’s convenience, an example of the estimation of the linear bias tendency as well as a performance comparison to a statistical bias correction method is provided in the appendix. Since the first experiment of C19 is the simplest with a good performance, we will use that same experimental setup in this study. Specifically, the bias correction step is only enforced in the potential temperature field *θ* in the model (i.e., *S* = *θ*) due to its dominance in the GRAPES-EPS bias error (the detailed discussions about the cause and the impact can be found in C19. The bias linear tendency of potential temperature is estimated from a 72-h window [with 13 (total) 6-hourly bias values in it; see the appendix] and is used for all time steps (i.e., *t* = 0) throughout the model integration [i.e., $Bl^\u2061(Sj,t)=B^l\u2061(\theta 0,0)$]. Therefore, the second experiment can be described by

The other state variables are the same as in the CTL EPS.

### d. Experiment combining stochastic physics and bias correction (GRAPES-REPS-LTBC-SPPT)

The third experiment (Exp. 3) is the combination of Exp. 1 and Exp. 2 to test the combined effects of stochastic physics and bias correction on EPS performance. Specifically, the bias correction is performed before the stochastic physics step:

where $B^l\u2061(Sj,t)=Bl^\u2061(\theta 0,0)$ for *θ* and $B^l\u2061(Sj,t)=0$ for other model state variables (i.e., the other state variables are the same as Exp. 1). Table 2 is a summary of the control EPS and three experiments performed in this study.

### e. Data

The forecast experiments were carried out over the period of a month (1–31 July 2015). The bias tendency is calculated from the prior bias. The prior bias was approximated by the average forecast error of the most recently available past 10-day forecasts. For example, the average errors of the 19–28 June forecasts are regarded as the prior biases for the 1 July forecast (6-hourly outputs during a 72-h forecast length). As mentioned above, only the control member’s bias is calculated and used for the 14 perturbed members. The reason for using a 10-day period to estimate bias is that it is not too short to miss the main features of systematic error, and it is not too long to completely filter out flow-dependent error. It should be beneficial to retain some recent flow-dependent bias information in the bias tendency, given that model bias is regime dependent (Du and DiMego 2008).

Since an analysis is the optimal estimation of truth a model can produce, the goal of the grid-to-grid model bias correction is to mimic its own model analysis, although the analysis itself may contain bias with respect to an observation (Privé et al. 2013).^{2} Therefore, the GRAPES analysis is used as truth for verifying the upper air temperature, geopotential height, wind (*u* and *υ*), and surface temperature and wind (2-m temperature and 10-m *u* and *υ*). The CMA Multisource merged Precipitation Analysis System version 2.1 (CMPAS-V2.1) (Pan et al. 2015) is used as truth for verifying precipitation. All the verification results presented in section 3 are averaged over these 31 days and the model domain (15°–64.35°N, 70°–145.15°E). Only the verification of 0000 UTC cycle forecasts is presented since the result from the 1200 UTC cycle is similar.

The following metrics are employed in the verification: root-mean-square error (RMSE), systematic error (bias), random error, spread, spread–skill relationship (consistency, defined as the ratio of RMSE to spread), continuous ranked probability score (CRPS), reliability curve, and outlier for basic variables; and Brier score (BS) and area of relative operating characteristics (AROC) for precipitation. For a review of ensemble verification, readers are referred to Jolliffe and Stephenson (2003) and Du and Zhou (2017).

## 3. Results

### a. Changes in ensemble spread and forecast error components

For the purpose of demonstration, 850-hPa temperature is used here. Figure 1 shows the improvements to the ensemble mean forecasts and ensemble spread for the three experiments (the blue curve) compared to the control EPS (the red curve). It clearly shows that the stochastic physics SPPT boosted the ensemble spread but only slightly improved the ensemble mean forecast accuracy (Fig. 1a). In contrast, the bias correction LTBC greatly reduced the ensemble mean forecast RMSE but made little change in the ensemble spread (Fig. 1b). When the stochastic physics and bias correction were combined, the unified scheme took the advantages from the both methods and noticeably improved both the ensemble mean forecast accuracy and ensemble diversity (Fig. 1c). This makes the ensemble spread and the ensemble mean forecast error about 20%–40% closer to each other than in the control EPS (cf. the control EPS and Exp. 3 in Fig. 1c): an improvement in the spread–skill relationship although the underdispersion is still apparent.^{3} The spread increase (by the SPPT and the unified scheme) and the RMSE reduction (by the LTBC and the unified scheme) over the control EPS are statistically significant at the 99.99% level (Student’s *t* test).

Figure 2 further analyzed the detailed aspects of the error reduction. The top panel (Figs. 2a1–a3) is the total error (RMSE) of the ensemble mean forecast from the three experiments (the blue curve) compared to that of the control EPS (the red curve), the middle panel (Figs. 2b1–b3) is the same but is for the systematic error (bias), and the bottom panel (Figs. 2c1–c3) is for the random error reduction. Since the ensemble averaging is a nonlinear filtering process to cancel random errors, the random error reduction shown in Figs. 2c1–c3 is defined as the difference between the average error of individual members and the error of the ensemble mean forecast. Figure 2 shows that the systematic error was reduced very little by the SPPT (Fig. 2b1) but was significantly (99.99%) reduced by the LTBC (Fig. 2b2). As for the random error, due to the increased diversity of ensemble members in the SPPT run, as seen in Fig. 1, the random error was reduced about 40% more by the SPPT (~0.035) than the LTBC (~0.025) (cf. the dashed curves of Fig. 2c1 and Fig. 2c2) averaged over 6–72 h. This difference is even more significant (almost double, from 0.015 to 0.03) when the forecast length is shorter (6–36 h). After combining the SPPT and LTBC, the unified scheme can significantly (99.99%) reduce both the systematic and random errors (Figs. 2b3 and 2c3), which leads to the largest reduction in the total error of the ensemble mean forecast among the three experiments (Fig. 2a3).

To verify the above result related to the random error, we employed a different independent approach to calculate the random error [Eq. (12)]. Equation (12) is a commonly used relationship among the total (RMSE), systematic (bias), and random errors in physics (Morris and Langari 2016):

For each of the four EPS runs (control EPS and Exps. 1–3), we can calculate the total error and systematic error of each member and the ensemble mean forecast. Then, the random error can be derived from Eq. (12) for each member and the ensemble mean. The difference between the averaged random error of individual members and the random error of the ensemble mean forecast reflects the random error reduction through the ensemble averaging process for a given EPS. Figure 3 shows the random error reduction of the four EPSs. The result confirms that the random error reduction through ensemble averaging was significantly larger with the stochastic physics (Exp. 1) compared to the control EPS (Ctr-EPS), while it remained similar to the control EPS with the bias correction (cf. Exp. 2 and Ctr-EPS). The unified scheme (Exp. 3) performed the best (slightly better than Exp. 1).

### b. Forecast improvements

In this section we will examine how the above error reductions translate into improvement of the probabilistic forecasts of various meteorological fields.

#### 1) Probabilistic forecasts of temperature, height, moisture, and wind

The CRPS is commonly used to measure the closeness between the forecasted and observed cumulative probability distributions. The closer a forecasted probability is to the observation (either 0 or 1), the better it is. Therefore, the CRPS becomes better as the values become smaller, with a perfect score of 0. Figure 4 compares the CRPSs of the three experiments and the control EPS for temperature, geopotential height, specific humidity, and wind at the 850-hPa level as well as for two surface fields (10-m wind and 2-m temperature) over 6–72-h forecasts. For all variables, the CRPS can be slightly improved (i.e., reduced) by Exp. 1 (SPPT) and greatly improved by Exp. 2 (LTBC). This is because the center of the probabilistic distribution has been correctly shifted toward the observation after the removal of systematic error in the LTBC run. The CRPS can be further improved on the top of the LTBC run by Exp. 3 (the unified scheme). The statistical significance of the improvement over the control EPS by the unified scheme exceeds the 90% level for all variables and forecast hours (most actually reached the 99.99% level). In other words, the unified scheme can significantly improve the quality of the probabilistic forecasts of all fields.

In addition to the closeness to the observation, another important property of a probabilistic forecast is the reliability. The reliability diagram graphically measures how well the predicted probabilities (the horizontal axis) of an event match its observed frequency (the vertical axis). For a perfectly reliable probabilistic forecast, the predicted probability is equal to the observed frequency, which puts the reliability curve along the 45° diagonal line. Deviation from the diagonal line suggests that the predicted probability is not reliable. If the reliability curve lies below (above) the diagonal line, the probability is overestimated (underestimated) due to either a too small (large) spread of the ensemble members or forecast biases. Figure 5 is the reliability diagram at the forecast length of 72 h for the same six variables as in Fig. 4. The probability thresholds used in Fig. 5 are, respectively, exceeding 2 K (850-hPa *T*), 8 gpm (850-hPa *H*), 0.0002 (850-hPa *q*), 2 m s^{−1} (850-hPa *V*), 1 m s^{−1} (10-m *U*), and 2 K (2-m *T*) over climatology. For example, Fig. 5a shows that the forecast probability of 850-hPa temperature exceeding 2 K over climatology is much larger than the observed frequency (i.e., overestimated) due to both an underdispersion in spread (Fig. 1) and a warm bias in the forecast (Fig. 2). The SPPT (the black curve) did not improve reliability, but the LTBC (the green curve) and the unified scheme (the blue curve) significantly (at the 95% level) improved the reliability compared to the control EPS (the red curve). Similar results were seen for the 2-m temperature (Fig. 5f) and 850-hPa geopotential height (Fig. 5b). However, the improvements from all the three experiments were not significant for the 850-hPa moisture (Fig. 5c) and wind (Fig. 5d) fields as well as the surface wind (Fig. 5e).

#### 2) Probabilistic forecasts of precipitation

The CRPS is for continuous forecasts (such as a continuous range of rainfall). For single-category forecasts (such as rainfall exceeding 50 mm), the CRPS can be simplified to the Brier score. BS measures the difference (error) between the forecast probability and the observed probability (0 or 1). A smaller BS indicates a better forecast (a perfect score is 0.0). Figure 6 shows the BS of the three experiments and the control EPS for the probabilistic forecasts of 24-h-accumulated precipitation exceeding 0.1, 10, 25, 50, and 100 mm for forecast lengths of 24, 48, and 72 h. Since there is almost no skill for the 100-mm category rainfall prediction by the GRAPES-EPS, it will be excluded in the following discussion. There is no significant difference (less than 20% level) between the three experiments and the control EPS at the 24-h forecast (Fig. 6a). There is a moderately significant improvement (40% level) by the unified scheme over the control EPS for the 10- and 25-mm categories at the 48-h forecast (Fig. 6b). The moderately significant improvement (40% level) can also be seen for the 10-, 25- and 50-mm categories at the 72-h forecast (Fig. 6c).

The relative operating characteristic (ROC) measures the combined effect of the false alarm rate (FAR; the horizontal axis ranges from 0 to 1) and probability of detection (POD, the vertical axis ranges from 0 to 1) for a binary forecast. A good forecast system needs to maximize POD and minimize FAR. The area under the ROC curve (AUR or AROC) is often calculated to determine if a forecast has skill or not. It has no skill when AROC is less than 0.5 (FAR > POD). A perfect AROC is 1 (100% POD and 0% FAR). Figure 7 is the same as Fig. 6 but for the AROC. There is a significant improvement (90% level) of the unified scheme over the control EPS for the 50-mm rainfall category at the 24-h forecast (Fig. 7a). A moderately significant improvement (75% level) for the 10- and 25-mm rainfall and a significant degradation (90% level) for the 50-mm rainfall were seen at the 48-h forecast (Fig. 7b). All rainfall categories (10, 25, and 50 mm) were significantly improved (60%–98% level) at the 72-h forecast (Fig. 7c).

From the above results, we can see that the improvement in probabilistic precipitation forecasts from the unified scheme generally increased as the forecast length increased. However, compared to the other atmospheric variables, precipitation was much less impacted by the SPPT, LTBC, and the unified schemes. The GRAPES-REPS has very little skill in predicting the rainfall category exceeding 100 mm day^{−1}.

#### 3) Operational implementation assessment scorecard

For an operational implementation decision, a scorecard is normally used, which verifies a more complete list of fields with more measurement scores. Figure 8 shows the scorecard used for this implementation, to assess if the upgrade system (the unified scheme) can consistently outperform the current system (the control EPS). In this scorecard, six scores are computed for some isobaric fields including geopotential height *H*, temperature *T*, zonal wind *U*, and meridional wind *V* at 200-, 500-, 700-, 850-, and 1000-hPa levels, as well as some near-surface fields such as 2-m temperature (T2m), 10-m wind (U10m, V10m), and light, moderate, and heavy precipitation at 24-, 48-, and 72-h forecast lead times. These six measurement scores are selected to portray a full picture of the EPS quality for ensemble mean, forecast uncertainty and probabilistic forecasts. For nonprecipitation fields the verification metrics are RMSE, consistency (RMSE/spread), CRPS, and outlier; for precipitation AROC and BS are employed.

There are a total of 294 categories in Fig. 8. The green color indicates an improvement, the red a degradation, and the gray is neutral (similar performance) for the unified scheme compared to the control EPS. The statistical significance level associated with an improvement or degradation is also marked on the figure. We can see that the improvement rate is 62.6% (184/294), the neutral rate is 35.4% (104/294), and the degradation rate is 2.0% (6/294). All the improvements are statistically significant. Consistent with the verification results in sections 3b(1) and 3b(2), more improvements are seen in the height and temperature fields, and less improvement in the moisture field including precipitation and upper-air wind. The improvements in surface wind and temperature are also apparent. Evidently, the overall improvement of the unified scheme over the control EPS is overwhelming, which leads to a guaranteed approval of its operational implementation.

### c. A side experiment

Although the theme of this study is the unified scheme to deal with the random and systematic errors at the same time in an ensemble model, one might be interested in learning how this unified dynamical approach compares to the current two-step approach (dynamic and statistics) in performance. To shed light on this, we included a comparison between the dynamical bias correction method and a statistical postprocessing method (see Fig. A2 in the appendix). The result shows that the dynamical method outperformed the statistical method by about 64%. Given that the SPPT part is the same in both approaches, the unified scheme should also outperform the two-step approach (i.e., the SPPT in the ensemble model integration followed by a statistical postprocessing of model bias). Additionally, the statistical postprocessing method also has severe limitations. For example, only a limited number of selected variables can be processed, and each variable is processed independently with no dynamical constraints among the processed variables. These deficiencies will cause dynamical inconsistency among variables and prevent the bias corrected variables from being used in certain applications, such as initializing a downstream model. In other words, performance is not the sole motivation of our study but, more importantly, the feasibility of a certain NWP downstream application is.

## 4. Summary

NWP models have both random and systematic errors. Ensemble perturbation methods deal with random errors, while statistical bias correction methods deal with systematic errors. With current NWP technology, these two types of error cannot be dealt with together during model integration. Very recently, we introduced a three-dimensional wholesale-like dynamical method to remove a model’s systematic bias in all variables during model integration (C19). This new model-integration-based bias correction approach has made it possible for a unified scheme to deal with both random and systematic errors at the same time as the model integrates. This study designed and tested such a unified scheme with a regional ensemble model, GRAPES-REPS, aiming to maximize the improvement in ensemble prediction skill by reducing both random and systematic errors at the same time in an operational environment.

Three experiments were performed on top of the control EPS (GRAPES-REPS). The first experiment used only the stochastic physics SPPT (GRAPES-REPS-SPPT). The second experiment used only the bias correction scheme LTBC (GRAPES-REPS-LTBC). The third experiment used both the LTBC and SPPT (GRAPES-REPS-LTBC-SPPT). The experimental period is 1–31 July 2015 (0000 UTC cycle) over the China domain. The averaged result of these 31 days showed that

The stochastic physics SPPT can effectively increase ensemble spread, while the bias correction LTBC had little impact. As a result, ensemble averaging of the SPPT run can reduce random error more than the LTBC.

The stochastic physics SPPT had little impact on the systematic error, while the bias correction LTBC can significantly reduce it. As a result, the SPPT only slightly improved the accuracy of ensemble mean forecasts but the LTBC greatly reduced the ensemble mean forecast error.

By combining the stochastic physics SPPT and the bias correction LTBC into a unified scheme, both random and systematic errors can be significantly reduced at the same time. As a result, the unified scheme performed the best among the three experiments.

The CRPS scores show that the unified scheme can significantly increase the accuracy of probabilistic forecasts for all six selected atmospheric variables (temperature, height, moisture, and wind at the 850-hPa level, 10-m wind, and 2-m temperature). Besides improved accuracy, the reliability of the probabilistic forecasts was also significantly increased for the 850-hPa temperature, 850-hPa height, and 2-m temperature but remained about the same for moisture and wind.

Relative to other atmospheric variables, the improvement in probabilistic precipitation forecasts was much less with the unified scheme. However, the improvement seemed to increase as the forecast length increases. For example, there was no significant improvement in BS for all rainfall categories at the 24-h forecast; there was a moderately significant improvement for two categories (10 and 25 mm) at the 48-h forecast and for three categories (10, 25, and 50 mm) at the 72-h forecast. This trend is also observed in terms of AROC: there was a significant improvement for one rainfall category at 24 h and for two categories at the 48-h forecast, while three rainfall categories were significantly improved at the 72-h forecast.

Finally a scorecard was used to assess the suitability of the unified scheme for an operational upgrade. The result showed that the unified scheme can consistently outperform the control EPS. Out of the 294 categories contained in the scorecard, the improvement rate is 62.6% (184/294), the neutral rate is 35.4% (104/294), and the degradation rate is negligibly low at 2.0% (6/294). All the improvements are statistically significant.

Further improvements to both ensemble perturbation and bias correction techniques are obviously needed to improve moisture (including precipitation) and wind forecasts in a future study. Based on this study, we encourage the NWP community to adopt this unified scheme approach in their EPSs to achieve the best forecasts in operations.

## Acknowledgments

The readability of the manuscript has been checked by Ms. Mary Hart of NCEP. The authors thank the three reviewers (Dr. Huiling Yuan, Dr. Jie Feng, and an anonymous reviewer) as well as the editor Dr. Zhaoxia Pu for their constructive suggestions to improve our initial manuscript. This work is sponsored by National Science and Technology Major Project of Ministry of Science and Technology of China (2018YFC1507405), National Key Technology Research and Development Program of the Ministry of Science and Technology of China (2015BAC03B01), and Natural Science Foundation of China (NSFC) (91437113).

### APPENDIX

#### Estimation of the Linear Bias Tendency and a Comparison to a Statistical Bias Correction Method

For example, in our case of a 72-h model integration initialized on 1 July 2015, the bias was approximated by the average error of the past 10 days of forecasts during the period of 19–28 June 2015 at each forecast hour. To retain some recent season and flow-dependent bias information, a moving 10-day (but not longer) time period is used to estimate the bias error. Once the bias error is available, the linear bias tendency $B^l\u2061(Sj,t)$ can be estimated from the variation of bias error with forecast time over a desired time window Δ. The red curve in Fig. A1 is the estimated bias error of potential temperature varying with forecast hour (0–72 h at 6-h intervals) at a model grid point. By applying a linear regression to the 13 bias values over this 72-h window, a linear fit of the bias tendency with time can be obtained (blue line in Fig. A1). Using the bias increment $b^$ over the time window Δ (72 h in this case), the linear bias tendency $B^l\u2061(Sj,t)$ over a time step *δt* (in seconds, *δt* = 60 s in this case) can be obtained as the following:

which is the same as Eq. (9) in section 2c. The linear bias tendency obtained is then used for all time steps within this time window Δ. By repeating the above steps on every model grid point at all model levels, a three-dimensional $B^l\u2061(Sj,t)$ can be obtained at every model integration time step for a model state variable (potential temperature in this case). Note that the selection of a shorter time window Δ may include more time-evolution information on the bias. However, C19 compared a 3-h window with the 72-h window and found a mixed result (improvement in some fields and degradation in other fields). C19 also compared a multistate-variable-based (potential temperature, wind, and pressure) experiment with a single-state-variable-based (potential temperature only) experiment and found no obvious benefit. As a matter of fact, we found that the single potential temperature–based method using the 72-h window performed the best overall. More research on this subject is currently under the way to understand all these findings.

To get an idea about the relative performance of this new bias correction method, we compared it with the Kalman filter–based (or decaying average) statistical bias correction method that is currently used in operations at both CMA and NCEP (Cui et al. 2012; Du and Zhou 2011). The result from this statistical method over the same time period (1–10 July 2015) for the 500-hPa temperature is shown by the blue curve in Fig. A2. Apparently, the performance of the new method (the brown curve) is significantly better (at the 99.9% level) than the current operational statistical method. For example, at a 72-h forecast, the domain-averaged bias error reduction is 33.3% for the statistical method and 54.7% for the new method, which is ~64% greater reduction achieved by the new method than by the statistical method. This is a very encouraging result.

## REFERENCES

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science*. Wiley Press, 543 pp.

## Footnotes

© 2019 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

^{1}

To maximize the benefit of the stochastic physics, the SPPT should be carried out in an environment with as little error as possible. Therefore, the bias correction is performed prior to the SPPT.

^{2}

Note: Correcting an analysis-like forecast to the observation at a site is a task of downscaling. The difference between the grid-to-grid bias correction and grid-to-point downscaling is often not clearly distinguished by many. These are two different steps in an NWP operation.

^{3}

Besides the insufficient sampling of model and initial condition uncertainty, model bias plays a significant role in contributing to the underdispersive nature of the GRAPES-REPS. For an in-depth analysis of how model bias adversely impacts the spread–skill relationship of an ensemble, readers are referred to Wang et al. (2018).