Weather forecasting skill has been improved over recent years owing to advances in the representation of physical processes by numerical weather prediction (NWP) models, observational systems, data assimilation and postprocessing, new computational capability, and effective communications and training. There is an area that has received less attention so far but can bring significant improvement to weather forecasting—the calibration of NWP models, a process in which model parameters are tuned using certain mathematical methods to minimize the difference between predictions and observations. Model calibration of the NWP models is difficult because 1) there are a formidable number of model parameters and meteorological variables to tune, and 2) a typical NWP model is very expensive to run, and conventional model calibration methods require many model runs (up to tens of thousands) or cannot handle the high dimensionality of NWP models. This study demonstrates that a newly developed automatic model calibration platform can overcome these difficulties and improve weather forecasting through parameter optimization. We illustrate how this is done with a case study involving 5-day weather forecasting during the summer monsoon in the greater Beijing region using the Weather Research and Forecasting Model. The keys to automatic model calibration are to use global sensitivity analysis to screen out the most important parameters influencing model performance and to employ surrogate models to reduce the need for a large number of model runs. Through several optimization and validation studies, we have shown that automatic model calibration can improve precipitation and temperature forecasting significantly according to a number of performance measures.
Automatic model calibration improves numerical weather forecasting by tuning the numerical weather prediction model parameters to match model predictions with observations.
Weather forecasting skill has been improving steadily over the years. The improvement is due mostly to advances in the representation of physical processes by numerical weather prediction (NWP) models, observational systems, analyses techniques and forecasting methods (e.g., data assimilation, statistical postprocessing, and ensemble forecasting), new computational capability, and effective communications and training. There is an area that has received less attention so far but can bring significant improvement to weather forecasting—the calibration of NWP models. Model calibration refers to a process in which model parameters are tuned to match model predictions with corresponding observations. This process may be done using a manual, unsystematic “trial and error” approach, or using an optimization algorithm to tune model parameters automatically to minimize the difference between model predictions and observations (Duan et al. 2006). Automatic model calibration is a common practice in many fields including hydrology, biology, communications, and finance. It focuses on reducing errors resulting from the specification of model parameters. This is different from other popular approaches. For example, a common approach is to reduce model structural error by developing better physical parameterization schemes (Stensrud 2007). Multimodel ensemble approaches have been employed to account for model structural uncertainty (Krishnamurti et al. 1999). Data assimilation methods are commonly used to reduce errors in model initial conditions (Kalnay 2003).
There are several reasons that the automatic calibration of NWP models is not practiced as widely as in other fields. First, a typical NWP model has many parameters (from tens to hundreds), which appear in model equations as constants or exponents. Parameter values may vary according to local conditions or climate regimes and are usually not measurable at the scale of application. The large number of parameters leads to a “curse of dimensionality” problem and makes optimization processes intractable. Second, NWP models simulate many meteorological variables, including precipitation, air temperature, atmospheric pressure, humidity, and wind speed. Model calibration must ensure that the simulation of all key meteorological variables is satisfactory. This requires model calibration be done via a multiobjective approach, which further increases the complexity of model calibration. Third, conventional automatic model calibration approaches require many model runs (up to tens of thousands) to obtain optimal parameter values. Since many CPUs are required to generate a multiday forecast for a limited domain, the extraordinary computational demand makes automatic model calibration very challenging.
The utility of automatic calibration of NWP models has been suggested by a number of researchers (Bennett et al. 1996; Evensen et al. 1998; Duane and Hacker 2007; Aksoy 2015). Various approaches have been attempted to optimize NWP model parameters, from traditional search algorithms such as downhill simplex method (Severijns and Hazeleger 2005), genetic algorithm (Yu et al. 2013; Ihshaish et al. 2012), and simulated annealing (Jackson et al. 2004), to adaptive sequential data assimilation (Gong et al. 1998; Mu et al. 2002; Ruiz et al. 2013). Traditional optimization methods are either not able to handle the high dimensionality of NWP models or are impractical because they require too many model runs. On the other hand, sequential data assimilation methods treat model parameters as time-varying properties of the system. Those methods can be easily implemented in an existing data assimilation framework as they treat model parameters as extended state variables and are appropriate for estimating time-varying parameters such as leaf area index, surface roughness, and albedos. However, most parameters are formulated as constant properties of the system. Even if they are time varying, those model parameters and state variables do not vary on the same time scale. A two-stage filtering is needed to estimate model parameters and model state variables separately (Santitissadeekorn and Jones 2015; Vrugt et al. 2005). More recently, methods specially designed for parameter estimation of large complex system models like NWP have shown promise in improving the forecasting skill of NWP and climate models (Bellprat et al. 2012; Johnson et al. 2015; Neelin et al. 2010).
This paper demonstrates one such automatic model calibration platform for optimizing the parameters of NWP models. The keys of this model calibration platform are 1) to reduce the number of tunable parameters to a tractable level (e.g., from the current tens to hundreds to 15 or less) and 2) to develop an optimization algorithm that requires a relatively small number of model runs (only a few hundred at most). To implement these keys, a number of mathematical techniques can be employed, including 1) using a design-of-experiment (DoE) approach to judiciously sample model parameter sets within their physical variability ranges, and then using global sensitivity analysis (GSA) methods to identify the parameters that have the most impact on model forecasts; 2) instead of optimizing model parameters directly by running the NWP model repeatedly, constructing a surrogate model (also called statistical emulator, or metamodel) to represent the error response surface of the dynamical NWP model using a finite small number of model runs; and 3) using a multiobjective optimization approach to find the optimal parameters of the surrogate model and then using them to approximate the optimal parameters of the NWP model. In the next section, we provide a brief description of the platform. In the sections thereafter, we illustrate the usefulness of the automatic calibration platform through a case study involving 5-day forecasting of summer precipitation and surface air temperature in the greater Beijing area using the Weather and Research Forecasting (WRF) Model.
A model calibration platform called Uncertainty Quantification Python Laboratory (UQ-PyL) has been developed, which has integrated different kinds of uncertainty quantification (UQ) methods, including various DoE, GSA, surrogate modeling, and optimization methods. It is written in Python language (PyL) and can run on all common operating systems such as Windows, Linux, and MacOS, with a graphical user interface for selecting and executing various commands. The different functions of UQ-PyL have been documented in Wang et al. (2016) and some specific UQ methods are described in several publications (Gan et al. 2014; Li et al. 2013; Wang et al. 2014; Gong et al. 2016). For example, how to use different GSA methods to reduce the dimensionality of complex dynamical models like the Common Land Model (CoLM) and the WRF Model has been demonstrated by Li et al. (2013), Di et al. (2015), and Quan et al. (2016). The idea behind parameter dimensionality reduction is to use sensitivity analysis to screen out the most important parameters that exert significant influence on model predictions (Guo et al. 2014). Once they are found, those parameters can then be optimized to maximize the model performance measures (Gong et al. 2015).
UQ-PyL also includes tools for constructing surrogate models and for conducting surrogate-modeling-based optimization. In Wang et al. (2014), an adaptive surrogate-modeling-based optimization (ASMO) method was described, which allows for effective and efficient searches of optimal parameters of large complex models using a low number of model runs. The goal of optimization is to find the minimum of an error response surface in the multiparameter space. ASMO is based on the premise that the optimal solution of a dynamical model can be approximated by the optimal solution of a surrogate model, which is constructed in two steps. The first step uses a DoE approach to create random parameter samples that are evenly distributed within the physical variability ranges of the parameters and then construct an error response surface using these parameter samples. The second step is to refine this response surface iteratively by using an adaptive sampling strategy that places more parameter samples in the promising parameter space based on information already gained on the existing response surface. Once this iterative process converges, the final response surface is treated as the surrogate model. The optimal solution of this surrogate model should approximate the optimal solution of the dynamical model.
NUMERICAL CASE STUDY.
We demonstrate how UQ-PyL can be used to calibrate the WRF Model through a case study. Our goal is to optimize the WRF Model parameters that are most important to the forecasting of precipitation P and surface air temperature T over the greater Beijing area during the summer monsoon. Nine 5-day precipitation events from the June–August (JJA) period between 2008 and 2010 are used for model calibration and 15 additional nonoverlapping 5-day precipitation events from 2005 to 2010 are chosen for validation of the optimization results (see Figs. ES1 and ES2). Those nine events were chosen for model calibration because they captured most of the heavy storm events over those years. The JJA 3-month period is chosen for calibration because over 70% of the annual precipitation in the area occurs during this period. Furthermore, storm events during this period are often associated with severe urban flooding that has resulted in tremendous economic and even human losses.
WRF Model, version 3.3, available from the WRF web portal (www2.mmm.ucar.edu/wrf/users/download/get_source.html), is used in the study. The setup for the WRF Model is described in Di et al. (2015) and the specific parameterization schemes used in this study are shown in Table 1. The WRF Model is run with a two-level nested domain setup over the greater Beijing area (Fig. 1), with the grid resolution of the outer domain (marked as “d01”) at 27 km and the inner domain (marked as “d02”) at 9 km. The model performance is evaluated over the d02 domain. The lateral and initial conditions needed to run the WRF Model are set using the NCEP Reanalysis data (Kistler et al. 2001). Based on the sensitivity analysis results of Di et al. (2015) and Quan et al. (2016), 9 parameters (see Table 2) selected out of a list of 23 tunable parameters (see the entire list of parameters in the supplemental materials; Table ES1; http://dx.doi.org/10.1175/BAMS-D-15-00104.2) were identified as important to P and T forecasting over the area. Subsequent surrogate modeling is therefore constructed using those nine parameters as independent variables. We employed the ASMO method to optimize those parameters. Four optimization runs were conducted with the ASMO method. The first two optimization runs were to minimize the normalized mean absolute errors (MAEs) of the P and T forecasts, respectively, while the third optimization run aimed to minimize the equally weighted normalized MAEs of P and T forecasts. The fourth optimization run was done using the ASMO method to maximum the weighted threat score (TS) of P forecasts for different storm categories.
The normalized MAE (NMAE), used in the first two optimization runs as the objective function (also called cost function), is computed based on the ratio of mean absolute difference between model prediction and observation over all grid points and forecasted time intervals using the given and default parameters; that is,
where , θ, and θ* Θ are the given and default parameter vectors, is the forecasted daily value at grid i and time t given θ, is the observation, N is the number of grid cells in domain d02, and M is the total number of forecasted time intervals. If θ = θ*, F(θ) = 1. If F(θ) < 1, it implies that θ would produce better forecasts than θ*. For optimization run 3, we used a weighted multiobjective function suggested by Liu et al. (2004) as we are concerned with minimizing the NMAE values of both P and T forecasts. The specific formulation of the objective function F'(θ) for this run is as follows:
where wj is the weight for variable j, j = 1 denotes P, and j = 2 denotes T, respectively. We assigned equal weights to both P and T (i.e., w1 = w2 = 0.5).
For optimization run 4, we used TS [also known as critical success index (CSI)] as the objective function. We compuate TS, which measures the fraction of forecast events that are correctly predicted based on observations, as follows:
where na is the grid counts of hits (i.e., both forecast and observation fall in prescribed threshold ranges), nb is the grid counts of false alarms (i.e., the forecast falls in the threshold ranges, while observation does not), and nc is the grid counts of misses (i.e., the forecast falls outside the threshold ranges, while observation falls in). For a 6-h precipitation event [mm (6 h)−1], the threshold values (mm) are set for six categories of precipitation events: light rain, moderate rain, heavy rain, storm, heavy storm, and severe storm are [0.1, 1), [1, 5), [5, 10), [10, 25), [10, 50), [50, ∞) mm, respectively. A higher TS score indicates a better forecast, with a perfect score being 1. For the calibration events chosen for this case study, we only have the first four categories. In the optimization run, we used a weighted TS score, with a weight of 0.25 assigned to those four categories of storms.
Besides NMAE and TS scores, we also computed a combination score named SAL (Wernli et al. 2008) to evaluate the forecast performance, where S, A, and L stand for the forecast performance in terms of structure, amplitude, and location of the storm, respectively. A large value for S with a range of [−2, 2] implies the forecast storm area is too broad or too flat compared to observation; a small value means the forecast is too peaked and a value of 0 indicates being perfect. Amplitude A, with a range of [−2, 2], measures the average bias in precipitation amount, with 0 being perfect, a positive value indicating overforecast, and a negative value indicating underforecast. Location L, with a range of [0, 2], measures the displacement of forecast storm center from observed storm center, with 0 being perfect. The exact formulas for computing those scores are given in supplemental materials.
For all optimization runs, the size of the initial set of parameter samples used to construct the initial surrogate model was set to 100, based on the suggestion by Wang et al. (2014). According to the ASMO algorithm, from 45 to over 130 additional parameter sets were sampled adaptively to obtain the final surrogate model and optimal parameter set for the first three optimization runs (see Fig. 2). Figure 3 compares the optimized NMAE values against those corresponding to the control run (i.e., the run made using the default parameters) for the nine individual 5-day calibration events as well as for all calibration events combined. The results clearly indicate that optimization has resulted in significant improvement in the NMAE values. The improvement for all events combined ranges from 14.34% for the third optimization run to 18.16% for the second optimization run. The improvement is more evident when the NMAE of a single variable (e.g., P or T) is optimized than when the weighted NMAE of both P and T forecasts is optimized. From Table 3 we note that the optimized NMAE values for the individual variables yield their respective best values (i.e., optimization run 1 yields the best value for P, while optimization run 2 for T) and the optimized weighted NMAE value of both P and T corresponds to the compromised values (i.e., they are better than default but worse than the individually optimized values).
When examining the relative improvement for the individual calibration events, we see improvement in most events for all three optimization runs. However, there are a few events in which the NMAE values are not improved by optimization. But those events tend to correspond to the events whose NMAE values are relatively small compared to that of other events. Figures 4 and 5 show the scatterplots and the coefficients of determination R2 between the 3-h T and P forecasts and corresponding observations for all grid points during the calibrated events using the default parameters and three sets of optimized parameters. For T forecasts, the R2 values of all three optimization runs have improved over that of the default parameters, with the second run (which optimizes T forecasts) having the highest R2. For P forecasts, the R2 values for optimization run 1 and run 3 have improved over that of the default parameters, but run 2 has degraded R2 from that of default value. This tells us that optimizing T forecasts may not result in improved P forecasts.
We also examined the optimization results for different lead times and found that the improvement is consistent for all lead times. Figure 6 shows the relative improvement of the P forecasts for different lead times for all calibration events, which ranges from 10.87% for day 1 to 24% for day 2. We further examined if the optimal parameters from run 3 would lead to improved forecasts for other meteorological variables such as 2-m relative humidity, surface air pressure, 10-m wind speed, and downward surface shortwave radiation. Figure 7 confirmed that the optimal parameters for P and T forecasts have indeed improved the forecasts of those variables, with improvement rate of 1.71% for surface air pressure to 27% for wind speeds. Figure 8 presents the SAL score for the calibration events for run 3 and shows that the optimized parameters lead to better SAL scores compared to the default parameters. The improvement due to optimized parameters is the most obvious for structure S and amplitude A, implying the storm-area coverage and magnitude are better, and slight for storm location L prediction. The large value for A for the default parameters is consistent with the finding that P forecasts using the default parameters overestimate observed P (see Fig. ES3, which compares the daily averaged P forecasts against observations over the entire nine events for the optimized and default parameters).
We have computed the performance measures for the individual validation events as well as for all validation events combined. The improvement in NMAE values, as well as for other performance measures such as R2, the SAL score, and the NMAE values for different lead times are similar to or slightly worse than those for the calibration events (see Figs. ES5–ES9 for detailed results).
Optimizing the WRF Model parameters using NMAE as the objective function has resulted in better performance according to a number of performance measures shown previously. However, it does not always lead to consistent improvement in the TS values (see Fig ES10, which shows the TS values of the P forecasts using the optimized parameters from optimization run 1). Since TS is a key performance measure emphasized by many operational meteorologists, we conducted an additional optimization run by using the weighted TS for different storm categories as the objective function. The optimization results indicate that the weighted TS was improved by 9.5% after 162 model runs (see Fig. ES11, which shows the convergence of optimization run 4). Figure 9 shows the TS scores for individual calibration events as well as for all calibration events combined for the four categories of storms, with the TS scores for the combined events varying from −1.09% for light rain to 23.88% for heavy rain. The decreased TS score for light rain is because a weighted TS is used as the objective function. Even though the overall TS would improve, a TS for an individual category may not. This actually indicated a problem when a single objective function is used for model calibration. Many studies have suggested that a truly multiobjective model calibration approach can be useful, in which Pareto optimal parameter sets are identified (Gong et al. 2016). In those Pareto optimal sets, some parameter sets correspond to optimal forecasting performance for light rain events, others being optimal for moderate or heavy rain events. None of the objective functions in the Pareto optimal sets can be improved in value without degrading some of the other objective values (Miettinen 1999). The optimal parameters obtained using TS as the objective function were validated using independent validation datasets. We found that TS for all validation events combined have been improved, even though TS for individual storms may not (see Fig. ES12).
We examined the differences in the optimized parameter values for all optimization runs and compared them against the default values (Fig. 10). Direct correspondence between the parameter values and model performance measure is not obvious, because P and T forecasts are the aggregated results of complicated, highly nonlinear interactions between different physical processes such as ascent or descent of moist air, turbulent exchanges of water and energy fluxes between land surface and atmosphere, and horizontal advection of momentum, mass, and energy. From Fig. 8 (also Figs. ES3 and ES4), there is an apparent overestimation of both P and T over the greater Beijing area when default parameters are used. The changes in most parameters tend to show the effect of depressing P and lowering T. For example, a large value for P12 (the scattering tuning parameter) means a higher scattering of solar radiation, leading to reduced shortwave radiation reaching the surface and thus decreasing evaporation from the ground, and ultimately depressed P and T. The changes in P3 (the multiplier for downdraft mass flux rate) and P5 (the starting height of downdraft above the updraft source layer) have similar effects. When their values increase, the downdraft flux becomes stronger, suppressing further development of updraft convection and depressing precipitation. Meanwhile, more evaporation from condensed water occurs during the downdraft, cooling the atmosphere temperature, thus reducing T. For P16 (the multiplier for the saturated soil water content), the optimized values for P and T are conflicting. When P16 value increases, soil moisture content increases and surface evapotranspiration is enhanced, thus inducing stronger P. On the other hand, when P16 value decreases, the incidence of P becomes weak and higher T results. Parameter P21 (the profile shape exponent for calculating momentum diffusivity coefficient) reflects the mixing intensity of turbulent eddies in planetary boundary layer. When its value decreases, turbulence diffusivity intensity is weakened, making the upward transfer of heat and water vapor from the ground surface slow down. So the formation of convection is more difficult, and P is thus reduced. The slower thermal eddy diffusivity also restrains the increase of T. For parameters P4 (multiplier of entrainment mass flux rate) and P8 (scaling factor applied to ice fall velocity), lower values for them lead to less favorable conditions for formation of rain and thus lead to reduced P. A larger value for P10 (the maximum value for the cloud ice diameter) has similar effect as the lower P8. When parameter P20 (the critical Richardson number for boundary layer) decreases, the planetary boundary layer height is depressed and thermal eddy diffusivity is decreased, leading to lower T. Of course, it is impossible to explain all the changes in parameter values completely because of nonlinear interactions among them.
CONSIDERATIONS FOR PRACTICAL APPLICATIONS.
We have demonstrated the potential of automatic model calibration as a new way to improve numerical weather forecasting. This study has been intended to provide practical guidance to operational NWP modelers. In practice, several considerations must be taken before using automatic model calibration methods. First, the model setup in this study is not exactly as in the operational settings for the greater Beijing area by the Beijing Institute of Urban Meteorology Research, which runs the WRF Model over a larger domain with three-level nested grids and the inner grid resolution at 1 km, compared to a two-level nested domain and a 9-km resolution for the inner domain in this study. The operational model is also initialized with a data assimilation system that provides more realistic initial conditions, which should lead to further improvement because of improved initial conditions for the model. If the optimization methodology presented here is applied to the operational setting, we need to recalibrate the WRF Model and the final optimal parameter values may differ from the ones we obtained in this study. Another important note is that the tunable parameters we selected for this study may not include all impactful parameters on the forecasts, owing to the fact we are not totally familiar with all of the schemes in the WRF Model. Any WRF modeler who wants to optimize the WRF Model parameters should make full use of her knowledge about the model in identifying the tunable parameters.
Second, we choose nine summer storm events over a 3-yr period to ensure that the optimized model parameters are reasonably robust. We found that the improvement in calibration events is consistent when we computed a number of performance measures (i.e., NMAE, TS, R2, the SAL scores, and NMAE for different lead times). More interestingly, the performance measures for other meteorological variables are also improved when P and T forecasts are optimized. The validation results confirm that the optimized parameters have also resulted in improved performance measures for the 16 individual validation events as well as for all validation events combined. When conducting model calibration for specific applications, one always needs to perform those validation studies to ensure the effectiveness of the optimization results. Further, even though the parameters obtained in this study were found to be optimal for forecasting summer storms, they may not be optimal for other types of storm events such as the winter storms. It may be arguable whether model calibration based on only nine calibration events is sufficient in practice, especially for the rare events. A more comprehensive study using a large sample of events may be necessary in the future to assess the impact of sampling error on optimal parameter estimates.
Third, the performance metrics used in calibration are very important. In this study we used NMAE and TS as the objective function and they resulted in different model parameters. We also used a number of performance measures such as the scatterplots, R2, and the SAL scores to verify the forecasts. In practice, the evaluation of model performance is multifaceted and the choice of objective functions must reflect the values of the forecast practitioner. It is possible that the optimal parameters for one metric may conflict with that for another metric. One should consider a multiobjective approach that weighs different objective functions subjectively to reflect one’s preference for a particular performance metric. Another approach may be used to search for a Pareto optimal set, in which all parameter sets are noninferior compared to another set according to at least one performance metric. With this set of Pareto optimal parameters, one can create an ensemble of forecasts using different parameters.
The optimization approach presented here is different from the sequential data assimilation approaches, which update model parameters as new data become available. The approach is also more efficient and effective than conventional optimization approaches (e.g., downhill simplex, simulated annealing, and genetic algorithm) as we relied on techniques such as parameter screening and surrogate modeling to achieve computational saving. We hope this study will stimulate more development in using automatic model calibration methods to optimize NWP model parameters.
This research was partially supported by the Natural Science Foundation of China (41375139), the Ministry of Science and Technology of China (2015DFA20870),and the Beijing Municipal Science and Technology Commission (Z141100003614052 and Z151100002115045). We gratefully acknowledge the insightful comments from the editor and three anonymous reviewers whose advices have greatly enhanced the readability of this paper.
A supplement to this article is available online (10.1175/BAMS-D-15-00104.2)