## 1. Introduction

Despite significant progress made in recent years in terms of model physics as well as numeric, the skill of dynamical models is still limited in many aspects. A major issue is the bias in the model forecasts with respect to observations (OBS). A number of sources may contribute to model bias, such as model numeric, model physics, and forecast methodology (Mass et al. 2002, 2003; Wu et al. 2005; Dee 2005; Delle Monache et al. 2006). The problem of bias in the forecasts is well recognized, and a number of techniques, such as model output statistics (MOS), postprocessing forecasts (Glahn and Lowry 1972; Neilley and Hanson 2004), and the objective consensus forecasting (OCF) system (Woodcock and Engel 2005), have been developed. Recently, the National Weather Service (NWS) developed a gridded MOS system that, like conventional MOS, reduces systematic bias (Glahn and Ruth 2003; Dallavalle and Glahn 2005). The bias due to numeric and forecast methodology may primarily arise from the projection of model data on a given horizontal and vertical grid-to-point (station) observation. This part of the bias may be expected to be somewhat systematic in nature, arising, as it does, from an adopted grid and methodology for interpolation. Recently, Steed and Mass (2004) experimented with several different spatial techniques of applying bias correction to forecasts of temperature from a mesoscale model. Eckel and Mass (2005) applied bias correction on fifth-generation Pennsylvania State University–National Center for Atmospheric Research (NCAR) Mesoscale Model (MM5) forecast grids used in an ensemble forecasting system before calculating ensemble means and probabilistic guidance. While the systematic part of a bias is relatively easy to handle (Abramowitz et al. 2007; Mass et al. 2008), the forecasts generally also show random errors that depend on a host of processes, including the location and time of the day. Conceptually, the dependence on the location can arise because of inadequate representation of special features of the surface or orography; similarly, location-specific error as a function of hour (of forecast day) and season may arise because small-scale synoptic systems and their interaction with the background state are not well represented in the model, either because of coarse resolution or because of inappropriate parameterization, or both. For many applications of high impact, however, the critical forecast accuracy has to be evaluated at station scale, especially over urban areas. Similarly, investigation of the effect of climate change on processes such as vulnerability requires projections at a local scale. It is necessary, therefore, to develop and evaluate methodologies to improve station-scale forecasts.

While generic improvement in model skill requires parallel and comprehensive development in model and other forecast methodology, one way of achieving skill in station-scale forecasts without (intensive effort) calibration of the model is to implement an objective debiasing. Besides, increasing the horizontal resolution does not necessarily improve the quality of forecast indefinitely (Mass et al. 2002). Given that bias in model simulations often depends on geographical locations and, as shown subsequently with raw forecasts (RF) with MM5 over India having significant bias, generic improvement such as through parameterization schemes may not provide an efficient way for improving station-scale forecasts. Another procedure that can contribute to bias in the forecast is downscaling. While the mesoscale models today can support horizontal grid spacing down to a few kilometers or fewer, downscaling of model forecasts to arrive at station-scale values will remain a necessary step. Although there has been effort and progress in developing improved procedures for downscaling, it is possible to consider it as a part of objective debiasing.

The objective of the present work is to examine a method of generating station-scale forecasts from raw forecasts from a mesoscale model (MM5). We consider 12 locations over India (Fig. 1), representing urban locations in different geographical conditions. To evaluate realizable skill, we use the forecasts for the first 10 days for each month to calibrate the debiasing parameters and then use the calibrated parameters for the remaining days of that particular month without using in-sample data. However, to examine potential skill, we also evaluate the skill with calibration using all the 30 days. In addition, a simple running-mean error removal (RMER) is considered as a null hypothesis.

In section 2 we describe the model configuration and the methodology, including the principle and the algorithm of objective debiasing. The results are presented in section 3; section 4 contains our discussion and conclusions.

## 2. Forecast configuration and methodology

The basic 24-h (raw) forecasts were generated using a mesoscale model, with initial and boundary data described below.

### a. Model configuration and design of the experiments

The 24-h forecasts are generated using the mesoscale model MM5, version 3, a nonhydrostatic model with extensive documentation and validation (Dudhia 1993) that is designed to simulate or predict mesoscale atmospheric circulation. The model allows options for parameterization of various processes, such as cumulus convection, planetary boundary layer (PBL) and radiative forcing. It can support multiple nests with varying horizontal grid spacing and has nonhydrostatic dynamics; the details of the model configuration chosen for this study are given in Table 1. All simulations in this study were carried out with a single domain covering the 12 locations (Fig. 1) as interior points; the integration time step in seconds was chosen on the basis of 3 × *dx* criteria, where *dx* is grid size.

The relaxation boundary conditions (Dudhia et al. 2010) were used in all the simulations, wherein the outer row and column were specified by time-dependent values, whereas the next four points are relaxed toward the boundary values with a linearly decreasing (away from boundary) relaxation constant. Global tropospheric analysis data of resolution 1° × 1° (available online at http://dss.ucar.edu/datasets/ds083.2/data) from NCAR were used to initialize the model. Terrestrial data include terrain elevation (30 min), land use [U.S. Geological Survey (USGS)-24 category, 30 min), and vegetation fraction (10 min). Terrain and vegetation fraction datasets are available online from the University Corporation for Atmospheric Research (UCAR;

The initial conditions were extracted from global fields available on a 1° × 1° grid at a 6-hourly interval from the National Centers for Environmental Prediction Global Forecast System (final) global gridded analysis (FNL). For each forecast the model was integrated with the initial field at 0000 UTC of the previous day; thus, a total of 123 forecasts were generated for 1 May–31 August 2009. The above methodology was applied to 12 locations spread over different terrains in India; the abbreviation subsequently used for each station is also given in the Fig. 1.

### b. Evaluation parameters

For an objective comparison of the forecasts, we consider a number of evaluation parameters described below.

*T*(

_{F}*n*,

*i*) and

*T*(

_{O}*n*,

*i*) represent the predicted and the observed temperature, respectively, at

*i*th observation hour for day

*n*; here,

*i*= 1, 6 is the hour at which synoptic station observations are available at 0600, 0900, 1200, 1500, 1800, and 2100 UTC from the India Meteorological Department (IMD).

*e*(

_{d}*n*) is the absolute error on

*n*th day and

*N*is the number of days.

*N*is the number of forecasts,

*T*is the

_{i}*i*th forecast, and

*O*is the corresponding observed value.

_{i}*O*

### c. Objective debiasing

*T*(

_{R}*n*,

*i*) is the raw forecast of temperature for day

*n*and observation hour

*i*in each month for location

*j*. Here

*α*is a dimensionless constant that is a function of location and month and

*β*is expressed as the inverse of temperature. It was found that a linear debiasing (

*β*= 0), or nonlinear debiasing with

*α*and

*β*constant (i.e.,

*β*independent of the diurnal cycle), did not result in an appreciable improvement. In particular, the systematic biases were found to be functions of month and hour of the day as well as location. The parameter

*α*was therefore considered a function of the month, while

*β*was considered a function of the period of the day (Table 2). The optimum values of

*α*and

*β*were obtained through a search procedure to minimize |

**X**

*−*

_{F}**X**

*|, where*

_{O}**X**

*and*

_{F}**X**

*represent the forecast and the observed values, respectively. In this procedure a range of values of*

_{O}*α*and

*β*are considered with sufficiently small intervals to arrive at the optimum values characterized by the lowest |

**X**

*−*

_{F}**X**

*| for the training sample.*

_{O}In the first case, optimum values of *α* and *β* were obtained using all the days in each month (May–August 2009) for each station (Table 2). As this procedure (referred to as potential debiasing) uses in-sample data, the skill assessed is not strictly realizable, and we shall refer to the skill as potential skill; the forecasts with this debiasing are subsequently referred to as DF-P. The skill with DF-P is essentially indicative of the maximum skill attainable with the procedure, or enhancement in skill likely if larger training samples were available. For assessing realizable skill (referred to as DF-R) without using in-sample data, the debiasing parameters *α* and *β* were calibrated using the first 10 days of each month using Eq. (5); these calibrated parameters were then applied for the bias correction for the remaining 20 (or 21) days in that month. It should be noted, however, that we use the mean of the observed values *O*

*T*(

_{R}*n*) is the raw forecast of daily averaged temperature for day

*n*and

*E*is the mean error at each station, which is calculated for the forecast of the previous 7 days.

## 3. Results

### a. Average diurnal cycle and bias

The model simulations were first tested for the (monthly) average diurnal cycle for May–August at each location (Figs. 2 –5). From the steep diurnal cycle over locations such as Amritsar (AMT) and Ahmedabad (AHM), to the rather flat diurnal cycles over coastal locations such as Chennai and Mumbai, the debiased forecasts capture them well for all four months. The number against the forecast in each panel in Figs. 2 –5 represents average error (first row) and correlation (second row) with respect to observations. It can be seen that the raw forecasts (hollow circle, Figs. 2 –5) in general showed larger errors; the correlation coefficients between observations and the debiased forecasts are larger than those for raw forecasts and are generally significant at the 99% confidence level for the degrees of freedom involved.

Stationwise and monthwise distributions of bias (Fig. 6) highlight the need for calibration of the debiasing parameters for each station and for each month separately. A scrutiny of average bias for each of the months shows generally systematic bias for raw forecast (Fig. 6). For all the months—May–August—the bias in the raw forecast is large and generally negative for the 12 stations. The number of stations with bias more than 1°C is 4 and 5 for May and June, respectively, whereas the numbers are 2 and 3 for July and August, respectively. In contrast, debiased forecasts (DF-R) have bias more evenly distributed that rarely exceeds 1°C and have much less bias, generally less than 0.5°C.

### b. Daily average temperature

The percentage of days (out of 123 days for RF and DF-P, 83 days for DF-R from May to August 2009) for which the error in the forecast is between −1° and 1°C is found (Fig. 7) to be generally below 50% for the raw forecasts, while for the debiasing forecast (DF-P) this number is 80%; for DF-R, the skill (76%) is below the DF-P (as expected) but significantly higher than that of the raw forecast. In terms of skill score for daily averaged temperature (Fig. 8), the debiased forecasts show not only higher skill compared to raw forecasts but the skill is significant for all 12 stations and four months, except for a few cases. In terms of monthly averaged error in daily averaged temperature, nonlinear debiasing can potentially reduce the error by more than 50% for all the stations (Fig. 9) for DF-P; the improvement for DF-R is only slightly less (Fig. 9). However, while the average error in daily averaged temperature is often close to 1°C for May and June, it stays close to 2°C even after debiasing for several stations.

### c. Minimum and maximum temperature

Accurate forecasts of minimum and maximum temperatures are critical for many applications, such as the assessment of the heating and cooling requirement. The dramatic improvement due to nonlinear debiasing can be also seen from a comparison of minimum and maximum daily temperatures (Figs. 10 and 11, respectively) from the three forecasts compared with observations. The minimum temperature at the 12 locations during May–August 2009 ranges from less than 20°C [Bengaluru (BNG)] to nearly 35°C over Amritsar (Fig. 8); the corresponding range for maximum temperature is 45°C (Amritsar and Bhubaneswar) to about 30°C over Bengaluru (Fig. 9). The skill score for minimum and maximum temperatures once again demonstrates the effectiveness of nonlinear debiasing (Figs. 12 and 13). The improvement in skill in forecasting the daily minimum and maximum temperatures is far more significant compared to that in daily averaged temperature; the monthly average error is generally close to or less than 1°C for all the stations (Figs. 14 and 15).

A summary of skill averaged over the 12 stations for each of the four months shows (Table 3) raw forecasts to have essentially zero or *negative* skill score in all cases. The realizable skill (DF-R), while generally lower than potential skill (DF-P) as expected, is significant for all four months. The average errors in daily average temperature, and minimum and maximum daily temperatures are generally less than 1°C for both DF-P and DF-R. In terms of percentage of days for which the daily average error is between −1°C to +1°C, the debiased forecasts appear to be far superior to raw forecasts for all the months; whereas nearly 76% (80%) of days for DF-R (DF-P) lie within −1°C to +1°C when averaged over all four months, the corresponding percentage for raw forecast is only 39%.

A comparison of skill of the present method with a number of other methods [such as the Eta Model (ETA), which was renamed the North American Mesoscale (NAM) model in 2005; model output statistics with ETA (ETAMOS); the Kalman filter with ETA (ETAKF); and a 7-day running mean bias removal with ETA (ETA7DBR)] (Maini et al. 2003; Cheng and Steenburgh 2007) for debiasing shows (Table 4) the present method to have generally better skill. In particular, both in terms of mean absolute error (MAE) and percentage of cases with absolute error less than 1°C, the present method gives significantly better result. However, establishing superiority of the method over other methods in an objective and quantitative manner requires a comparison for a fixed set of raw forecasts generated using the same model configuration and the same events. This would require a very different set of experiments, beyond the scope of the present work. It is also possible that the degree of improvement due to debiasing will depend on model configuration and horizontal as well as vertical resolution; although this issue is particularly important for operational applications, it is unlikely to change our conclusions qualitatively. While higher resolution may improve the raw forecasts, the importance of objective debiasing will remain as long as the raw forecasts are not bias free.

A comparison of the present method with a 7-day RMER with the raw forecasts shows the objective debiasing to have consistently superior performance (in terms of skill score, average absolute error, and percentage of days in error bin of −1°C to +1°C for daily average temperature) for all four months (Table 5). It is interesting to note that in terms of percentage of days for which daily averaged temperature is in the error bin −1 to +1°C (last column, Table 5), RMER shows much larger variations among the four months in comparison to both RF and DF-R; however, RMER is found to be superior to the raw forecast in all the cases.

## 4. Discussion and conclusions

Station-scale forecasts are necessary for many applications related to health (such as vector-borne disease), agriculture (such as germination potential) and industry (such as power requirements), where the diurnal cycle of temperature plays a critical role. Such station-scale forecasts from dynamical models have to be necessarily obtained through a procedure of downscaling. Similarly, while typical climate simulations generate fields averaged over thousands of square kilometers, many applications require meteorological filed at local scale. The main objective of the present work was to assess an objective debiasing methodology to obtain significant skill in 24-h forecasts of diurnal cycle of surface temperature. Objective nonlinear debiasing not only improves forecast skill above raw forecast but results in significant skill. The consistent performance of the method for different conditions makes it an attractive tool for forecasting urban and location-specific weather. An advantage of the present method is that it can be easily adapted for new model configuration, and extended to more locations.

As mentioned earlier, the methodology of debiasing explored here does not improve forecast skill in a generic sense; thus, the skill achieved for specific locations does not necessarily reflect skill over the domain as a whole. It is conceivable to generate fields of debiasing parameters on a grid if sufficient observations are available; this possibility, which requires considerable effort, will be explored in a separate work. It needs to be emphasized that the single-nest model configuration with 23 vertical levels adopted in this study is not necessarily optimal. Thus, especially for operational applications, more optional model configurations may yield higher skill. The present experiments were carried out for 24-h forecasts; an important future direction of research is to evaluate the procedure for longer lead. Similarly, the skill of the methodology for the winter months needs to be evaluated in a subsequent work.

While more than 83 days of forecast for each of the 12 stations provide a sizable sample for skill evaluation, the present study examines skill for a single year. Thus, the important question of interannual variability in the station variables, and hence the stability of the debiasing parameters over a period of time (years), is not addressed in this work. This issue is important for the actual implementation of the method; however, we have deferred it for an independent study. In particular, it will be necessary to examine the effectiveness of the objective debiasing for a number of years based on the calibration of debiasing parameters for any given year. The true forecast potential of the methodology can only be judged when it is applied to other years with the same debiasing parameters for the month and the stations. While this is a computationally expensive proposition, it needs to be explored to establish the methodology on a firmer footing.

## Acknowledgments

This work was supported by a research project “Integrated Analysis for Impact, Mitigations and Sustainability” from CSIR (PPD), India.

## REFERENCES

Abramowitz, G., , Pitman A. , , Gupta H. , , Kowalczyk E. , , and Wang Y. , 2007: Systematic bias in land surface models.

,*J. Hydrometeor.***8****,**989–1001.Cheng, W. Y. Y., , and Steenburgh W. J. , 2007: Strengths and weaknesses of MOS, running-mean bias removal, and Kalman filter techniques for improving model forecasts over the western United States.

,*Wea. Forecasting***22****,**1304–1318.Dallavalle, J. P., , and Glahn B. , 2005: Toward a gridded MOS system. Preprints,

*21st Conf. on Weather Analysis and Forecasting/17th Conf. on Numerical Weather Prediction,*Washington, DC, Amer. Meteor. Soc., 13B.2. [Available online at http://ams.confex.com/ams/WAFNWP34BC/techprogram/paper_94998.htm].Dee, D. P., 2005: Bias and data assimilation.

,*Quart. J. Roy. Meteor. Soc.***131****,**3323–3343.Delle Monache, L., , Nipen T. , , Deng X. , , Zhou Y. , , and Stull R. , 2006: Ozone ensemble forecasts: 2. A Kalman filter predictor bias correction.

,*J. Geophys. Res.***111****,**D05308. doi:10.1029/2005JD006311.Dudhia, J., 1993: A nonhydrostatic version of the Penn State–NCAR Mesoscale Model: Validation tests and simulation of an Atlantic cyclone and cold front.

,*Mon. Wea. Rev.***121****,**1493–1513.Dudhia, J., , Gill D. , , Manning K. , , Wang W. , , and Bruyere C. , cited. 2010: PSU/NCAR mesoscale modeling system tutorial class notes and user’s guide (MM5 modeling system version 3). [Available online at http://www.mmm.ucar.edu/mm5/documents/tutorial-v3-notes-pdf/].

Eckel, F. A., , and Mass C. F. , 2005: Aspects of effective mesoscale, short-range ensemble forecasting.

,*Wea. Forecasting***20****,**328–350.Glahn, H. R., , and Lowry D. A. , 1972: The use of model output statistics (MOS) in objective weather forecasting.

,*J. Appl. Meteor.***11****,**1203–1211.Glahn, H. R., , and Ruth D. P. , 2003: The new digital forecast database of the National Weather Service.

,*Bull. Amer. Meteor. Soc.***84****,**195–201.Maini, P., , Kumar A. , , Rathore L. S. , , and Singh S. V. , 2003: Forecasting maximum and minimum temperatures by statistical interpretation of numerical weather prediction model output.

,*Wea. Forecasting***18****,**938–952.Mass, C. F., , Ovens D. , , Westrick K. , , and Colle B. A. , 2002: Does increasing horizontal resolution produce more skillful forecasts?

,*Bull. Amer. Meteor. Soc.***83****,**407–430.Mass, C. F., and Coauthors, 2003: Regional environmental prediction over the Pacific Northwest.

,*Bull. Amer. Meteor. Soc.***84****,**1353–1366.Mass, C. F., , Baars J. , , Wedam G. , , Grimit E. , , and Steed R. , 2008: Removal of systematic model bias on a model grid.

,*Wea. Forecasting***23****,**438–459.Neilley, P., , and Hanson K. A. , 2004: Are model output statistics still needed? Preprints,

*20th Conf. on Weather Analysis and Forecasting/16th Conf. on Numerical Weather Prediction,*Seattle, WA, Amer. Meteor. Soc., 6.4. [Available online at http://ams.confex.com/ams/84Annual/techprogram/paper_73333.htm].Steed, R. C., , and Mass C. F. , cited. 2010: Bias removal on a mesoscale forecast grid. [Available online at http://www.mmm.ucar.edu/mm5/workshop/ws04/Session2/Steed.Rick.pdf].

Stewart, T. R., , and Reagan-Cirincione P. , 1991: Coefficients for debiasing forecasts.

,*Mon. Wea. Rev.***119****,**2047–2051.Woodcock, F., , and Engel C. , 2005: Operational consensus forecasts.

,*Wea. Forecasting***20****,**101–111.Wu, W., , Lynch A. H. , , and Rivers A. , 2005: Estimating the uncertainty in a regional climate model related to initial and lateral boundary conditions.

,*J. Climate***18****,**917–933.

Model configuration.

Debiasing coefficients for nonlinear debiasing forecast. Here, *α* is a dimensionless constant that is a function of location and month and *β* is expressed as the inverse of temperature and is calculated on the three different time intervals—0400–0900, 1000–1500, and 1600–2100 UTC—for each day for stations AHM, AMT, and BNG. Similar values of *α* (between 0.01 and 0.2) and *β* (between −0.01 to 0.01) have been used for the other stations.

Average skill of all 12 stations for RF, DF-P, and DF-R debiasing forecast for May–August 2009.

A comparison of the performance of MM5-RF and MM5-DF-R in terms of MAE, bias error (BE), and RMSE in maximum (*T*_{max}) and minimum (*T*_{min}) temperature by different methods. For MM5-RF and MM5-DF-R, all parameters are calculated out of 123 and 83 days, respectively, for May–August 2009; NCMRWF = National Centre for Medium Range Weather Forecasting.

A comparison of different skills, including the average for four months, for RF, 7-day RMER, and DF-R for May–August 2009.