## 1. Introduction

Regional operational numerical weather prediction (NWP) systems such as the Weather Research and Forecasting (WRF) Model and the fifth-generation Pennsylvania State University–National Center for Atmospheric Research Mesoscale Model (MM5) provide forecasters and local weather consumers with high-resolution gridded mesoscale forecasts on a regular basis. However, various substantial systematic errors, or biases, which are present in all numerical weather prediction systems, significantly affect the quality of the forecast grids and reduce the potential benefit of the gridded forecast traces (Mass 2003). With the increasing availability of mesoscale forecast data to the public for such relatively new applications as weather risk management and digital warning systems, simple bias correction methods applicable to gridpoint data become an essential and crucial step for the operational benefit of mesoscale forecasts (Stensrud and Skindlov 1996).

A standard approach for bias correction is the statistical postprocessing of model outputs. There exist two main objective methods for output postprocessing; these are known as perfect prog and model output statistics (MOS) (see Glahn and Lowry 1972; Glahn 1985; Wilks 1995, and references therein for a detailed overview). The perfect-prog approach develops equations based on the relationships between observed weather variables, which are then applied to model output. The MOS approach develops a statistical linear relationship between actual observations and model output. In contrast to perfect prog, MOS may utilize valuable information from model output variables that are not easily observed. Typically, the MOS approach requires a developmental data archive of significant length (up to 48 months) from an unchanged model and needs to be modified when the model has been significantly updated. Compared to perfect prog, the MOS approach is considered to be more suitable for the needs of modern weather forecasting. Nowadays, numerous variations of MOS are the most popular postprocessing technique in NWP. In particular, various modifications of MOS have been recently suggested that do not require a relatively long database of stable models runs, for example, the updateable model output statistics (UMOS) of Wilson and Vallee (2002) and the model output calibration of Mao et al. (1999). The performance analysis of short-range ensemble prediction of surface temperature by Stensrud and Yussouf (2003) shows the advantages of a simple bias-corrected ensemble mean not requiring a long data archive over the nested grid model (NGM) MOS. More recently, novel nonlinear statistical postprocessing methods have been investigated including the neural network–based methods of Kuligowski and Barros (1998), Applequist et al. (2002), Marzban (2003), and Yuval and Hsieh (2003); the Bayesian methods of Nott et al. (2001); and the generalized additive models (GAMs) of Vislocky and Fritsch (1995). However, each of the above-mentioned methods still relies on the existence of historical bias data at the points of interest, that is, sites at which the bias correction should be performed. If one verifies against actual observations, then these methods are not directly applicable to the problem of grid-based bias correction because automated weather observing sites rarely coincide with the necessary grid points. If one verifies against some reanalysis data, then the above-mentioned methods show high sensitivity to the model changes. In addition, the reanalysis-based bias removal methods are limited with respect to the model resolution. [Throughout the paper, “reanalysis” refers to the concept introduced by Kalnay et al. (1996).]

Our analysis has been motivated by the problem of a mesoscale grid-based correction of systematic bias with verification with respect to observing sites (Mass et al. 2003; Mass and Kuo 1998; Stensrud and Skindlov 1996). However, the proposed approach is equally applicable to grid-based bias correction with verification with respect to reanalysis while potentially being robust to model changes. Our main focus is on methods that are free from the assumption of the existence of any historical observations or bias data at a point of interest (while still requiring some historical information from stations in the vicinity of a point of interest).

The presented results extend the approaches of C. Mass (2004, personal communication), Wedam et al. (2005), Gel et al. (2004), and Tebaldi (2002). We consider the comparative analysis of three methods for the bias correction of gridpoint data. The first technique is the local observation based (LOB) method for systematic bias removal, which does not require the existence of any previous history of observations at the site of interest such as a grid point. In the absence of historical observations, meteorological information is obtained from observing stations in the neighborhood (i.e., vicinity) of a given grid point. We define a cluster of neighbor stations on the basis of proximity, land use information, and terrain height. The bias at a site of interest (e.g., grid point) might then be considered as a function of the weighted information on the past biases observed in the cluster of neighbor stations during a certain period of time. This function might be subsequently used to predict the current and future biases at the site of interest. Using the historical observations at the neighbor stations, this approach allows us to correct the forecast bias at an arbitrary grid point, which, in practice, rarely coincides with the observing stations.

Based on ideas of generalized additive models (Vislocky and Fritsch 1995), this second method may be viewed as an extension of the MOS approach. The novelty of the method is that it incorporates nonlinear relationships between the predictand, that is, the bias at a grid point, and spatiotemporal predictors, that is, longitude, latitude, terrain height, land use information, and time, by utilizing modern regression techniques such as the classification and regression trees (CARTs; Breiman et al. 1984; Fielding 1999; Burrows et al. 1995; Spark and Connor 2004) and the alternative conditional expectation (ACE) (Breiman and Friedman 1985; Buja and Kass 1985; Hastie and Tibshirani 1990). The stepwise model selection based on the Akaike information criterion (AIC) is used to determine the final list of statistically significant predictors (Weisberg 1985). Compared to MOS, the main strength of the CART–ACE approach is its ability to correct the bias at the sites of interest with no prior history of observations, for example, gridpoint data.

Finally, the third considered method is the natural combination of the LOB and the CART–ACE methods in which the information provided by the LOB method is interpreted as an extra predictor in the linear regression model.

Section 2 is devoted to the description of the statistical methods. Section 3 illustrates the application of the discussed approaches to the bias correction of 48-h MM5 forecasts of surface temperature in the Pacific Northwest. The proposed methods are compared with each other and the grid-based “obs-based” method of Wedam et al. (2005) in terms of mean absolute error (MAE). In addition, section 3 reports the results on the “improve to hurt” statistics of the proposed techniques, that is, the number of cases when the bias has been removed or added at a site of interest. The paper concludes with a discussion in section 4.

## 2. Methods

### a. The LOB method

The goal of this paper is to estimate the statistical relationship between the current forecast bias at a grid point and historical biases at the neighbor observing sites, for example, local meteorological stations. In particular, the LOB method consists of measuring the biases during a *T*-day period prior to the forecast initialization over the neighbor stations that are geographically close to the grid point of interest, are close in terms of elevation and land use, and possibly have similar climatology and other characteristics. To correct the current bias at a grid point, we calculate the weighted bias across the neighbors over the *T* previous days and subtract the resulting quantity from the current forecast at the grid point of interest.

Potentially, the LOB method will require a less extensive archive of observational data at the neighbors than the traditional MOS-based methods. In fact, a richer spatial representation, that is, a greater number of neighbors, with a shorter time span may still provide sufficient information to estimate a statistically robust model. In addition, the LOB method potentially avoids difficulties associated with seasonal and diurnal effects and the spatial inhomogeneity of forecast data while providing a simple way to correct for bias at the grid points.

The optimal choice of the time window *T* can be assessed empirically using various historic information sources. To choose an appropriate geographical size of the neighborhood, we suggest utilizing the semivariogram approach.

*w*, we construct the following empirical semivariogram: where

*s*

_{1}and

*s*

_{2}are the coordinates of the data points at which the biases

*w*(

*s*

_{1}) and

*w*(

*s*

_{2}) have been estimated. Here, we assume that our data are weakly stationary and isotropic. In particular, weak stationary implies that covariance between two locations

*s*

_{1}and

*s*

_{2}depends only on the distance between those two locations,

*d*(

*s*

_{1},

*s*

_{2}); that is, cov(

*s*

_{1},

*s*

_{2}) is a function of only

*d*(

*s*

_{1},

*s*

_{2}) but not of

*s*

_{1}and

*s*

_{2}. Hence, cov(

*s*

_{1},

*s*

_{2}) = cov[

*d*(

*s*

_{1},

*s*

_{2})]. Isotropy assumes that the covariance is independent of direction; that is, cov(

*s*

_{1},

*s*

_{2}) depends only on |

*d*(

*s*

_{1},

*s*

_{2})|. Thus, under those conditions, we can construct the following exponential semivariogram model:

In geostatistical terminology, the parameter *r*, or range, is the distance at which the observations are (almost) uncorrelated; *ρ* is called the nugget effect and is usually interpreted as the measurement error variance or small-scale variability; *σ*^{2}, or a sill, denotes the variance of the bias *w* (without a measurement error); *ρ* + *σ*^{2} is the marginal variance of *w*. Hence, the estimate of the range *r* that can be obtained by the nonlinear least squares (NLS) or the maximum likelihood (ML) methods can provide an insight into the feasible size of the neighborhood. (In the statistical literature, the ML estimation method is typically preferred over the NLS method because ML provides more accurate estimates of the parameters although it is more computationally expensive than NLS.) In general, the assumption of isotropy can be elaborated upon and a directional variogram can be considered instead. Elaboration of the weak stationarity assumption is a challenging statistical problem of spatiotemporal modeling and can be approached, for example, by a Bayesian methodology.

The similarity of the land use information of the grid point and potential neighbor stations is the essential criterion for the selection of the neighbors (C. Mass 2004, personal communication; Wedam et al. 2005). Since not all land use categories have a sufficient number of representatives, we combine them into groups based on their physical properties and a statistical analysis of the estimated bias.

Wedam et al. (2005) propose an obs-based method that also takes into account regime changes. In particular, in addition to choosing neighbors based on proximity, land use, and elevation, they suggest taking into account only prior dates on which neighbors had similar weather forecasts as the site of interest on any chosen day. The obs-based method suggests going back in time far enough to get approximately four to seven matches, which are then simply averaged across approximately five neighbors. However, questions remain in regard to the obs-based method: how far one should go geographically to define the neighbors; why and how to choose a predefined number of neighbors, for example, five neighbors; and how to deal with a shortage of matches and neighbors in the reasonable time window. Potentially, the neighborhood estimated from the semivariogram will generally supply a richer spatial bias representation while providing a more solid statistical justification of the choice of neighbors. (Obtaining a richer spatial representation implies mining more diverse information on the bias from various spatial locations while still making use of only relevant sources information. The variogram method cannot be considered as a single optimal recipe for choosing a size of a neighborhood but can be viewed as a theoretically justified rule of thumb for selecting a radius of vicinity.)

Intuitively, the more recent information provided by the neighbor stations should have a greater impact on the current bias at a grid point and, hence, should be appropriately weighted. To choose a weighing scheme, we perform a time series analysis of the bias on any given Julian day aggregated over the available neighbor stations. We consider two weighing schemes: simple averaging, or no weighing—that is, the information is simply aggregated over the selected time window *T*; and exponential smoothing—that is, the information on *t* days beforehand is weighted as *e*^{−}* ^{t}*. Each weighing scheme is evaluated empirically by our case study.

In general, the LOB method can be extended to also include diurnal information, climatology, a station type, the “improve to hurt” ratios, etc. However, for now in order to illustrate the main statistical methodology, we restrict our analysis to a simplified version of the daily information and select neighbors in terms of the following criteria:

- geographical proximity, for example, as estimated by the empirical semivariogram method,
- the same land use information group, and
- similar terrain height.

The issue of what a typical (meteorological) observation point represents and how the observed values at this point are related to the forecasted values at the grid cell is an established scientific problem that is known in statistics as a change of support problem (COSP). In general, COSP relates to the situations when the data are acquired at one scale of resolution while it is necessary to provide an inference about what would be expected at a different scale. In the current framework, our focus is to utilize observations at irregularly located meteorological stations in order to draw some inference about the bias-corrected forecast at a 12-km grid scale. To illustrate the problems that arisen from COSP and representativeness issues, consider a situation where we utilize the raw land use information at 1-km resolution to get the predominant land use type for the 12-km grid cell. Let a given 12-km grid cell contain 60% of land use category A and 40% of land use category B. Hence, the assigned land use type would be A. Now suppose that we have two meteorological stations equidistant from the grid cell center. One station belongs to the A type and the other belongs to the B type. Suppose that the observed temperature at the A type station is 25°C and at the B type station it is 20°C. Hence, the correct value for the grid cell is 0.6 × 25°C + 0.4 × 20°C = 23°C. Let the forecasted value for the grid cell be 27°C. In the currently proposed method, that is, if no information about the land use distribution for a grid cell is available, we shall use only the A type station as it belongs to the dominant group. Therefore, the bias adjustment will be 27°C − 25°C = 2°C, which implies that the “corrected” bias at the grid cell will be 27°C − 2°C = 25°C and will not coincide with the correct value of 23°C. In contrast, if we also utilize the B type station, the bias adjustment will be 0.6(27°C − 25°C) + 0.4(27°C − 20°C) = 4°C, which will lead to the correct value of 27°C − 4°C = 23°C at the grid cell. Hence, knowledge of the distribution of the land use information is essential for operational grid-based bias removal. In the current paper due to the lack of such information, it is not included in the model training. However, if such information is known, it can be relatively easily incorporated into the proposed methods without requiring any major methodological updates, for example, as an extra predictor in CART–ACE or/and to optimally select a valid group of neighbors, and should necessarily be taken into account. Now suppose that the B type station does not provide the most recent observations. The questions that arise are whether the correct value for the grid cell should be calculated as 0.6 · 25°C, 25°C or the missing information about the B type station should be, in the first place, inferred from other sources and then incorporated into the analysis. The other questions are what happens if the two stations are not equidistant from the center of the grid cell, whether the center of the grid cell should be used as the reference location for the grid cell, and whether there exist some other variables affecting the bias at the grid cell that are not yet taken into account. Finally, all of these questions are tightly related to the issue of how we plan to verify our results, that is, versus observations at the meteorological stations, from the reanalysis, or through a combination of observations at stations and from the reanalysis. Thus, the representativeness issue is a challenging problem that in many situations does not have a precise solution but can be approximated with varying degrees of accuracy and complexity. For more discussion on COSP and the related representativeness issue, see Cressie (1993), Pielke et al. (2000), and Christy (2002), and references therein.

In addition, we also consider the robust modification of the LOB method (RLOB) that is less sensitive to outliers (anomalies) in the historical data on the observed biases at the neighbors. One can potentially avoid the usage of obvious outliers, that is, anomalous data, in the bias removal process by detecting and omitting those extreme observations. Since biases are approximately normally distributed, we can construct a two-sided 95% confidence interval (CI) of all of the observed biases at the chosen neighbors over the time window *T*. To use the standard normal statistics, we standardize biases by subtracting a mean and dividing by a median absolute deviation (MAD) from a median, which is a robust measure of the standard deviation. Subsequently, we make use only of observations that fall within the obtained 95% CI. This approach is adaptive in time–space and to model changes since MAD, mean, and, hence, a 95% CI are re-estimated for every neighborhood over the time window *T*, and potentially avoids the usage of anomalous data in the bias removal process.

### b. The CART–ACE method

The idea behind the CART–ACE method is to construct a general linear spatiotemporal model of bias, which may then be applied to any site and, in particular, to any grid location, at any point in time, independently of the existence of an observation archive at the location. We divide the modeling procedure into three steps: the analysis of the spatial variation of bias, the analysis of the temporal variation of bias, and finally the integration of the first two steps into a spatiotemporal model of bias.

*Y*is the temporally aggregated bias and the predictors

*X*are spatial variables, for example, longitude, latitude, terrain height, land use information, and possibly their combinations.

_{j}The general idea of CART–ACE consists of utilizing two methods of modern regression to predict *Y:* the classification and regression trees (CARTs; see Breiman et al. 1984; Fielding 1999; Burrows et al. 1995; Spark and Connor 2004) and the alternative conditional expectation (ACE; Breiman and Friedman 1985; Buja and Kass 1985; Hastie and Tibshirani 1990). The CART method splits latitude, longitude, and elevation into new binary predictions, or in other words, CART divides the whole domain into local relatively homogenous subdomains, and observations in each subdomain are treated similarly. The ACE method makes use of spline functions of latitude, longitude, and elevation as predictors in the model. Such an approach allows for the increase of the predictive power of the resulting additive model. Both CART and ACE are nonparametric methods of modern regression and aim to reflect nonlinearity in bias, which is not possible to model using the classical linear regression. A more detailed description of CART and ACE is given below.

The aim of the CART method is to optimally split *X _{j}* and provide a set of new binary predictors. Rather than to attempt an explicit global linear model for prediction or interpretation, tree-based models seek to bifurcate the data, recursively, at critical points of the predictor variables in order to divide the data ultimately into groups that are as homogeneous as possible within, and as heterogeneous as possible between. The results often lead to insights that other data analysis methods tend not to yield. Constructing trees may also be seen as a type of variable selection. Possible interactions between variables are handled automatically, as are (to a large extent) monotonic transformations of the

*Y*and

*X*variables. These issues are reduced to variables to divide on, and how to achieve the split. In the regression trees, each terminal node gives a predicted value, and the prediction is constant over each cell of the partition induced by the leaves of the tree. Using the regression tree allows us to obtain a maximum of

_{j}*n*binary predictors, which corresponds to

*n*terminating nodes of the tree.

*Y*and

*X*is likely to be described by nontrivial nonlinear functions. The ACE algorithm smoothly transforms

_{j}*X*in order to maximize the correlation between the response and the transformed predictors. In particular, the model defined by (2.3) may be replaced by the linear additive model: Here,

_{j}*f*are some smooth nonlinear functions. Since

_{j}*f*is rather general, the new model (2.4) obviously subsumes the old model (2.3).

_{j}The overall goal of ACE is to “linearize” the relationship between the predictand *Y* and predictors *X _{j}*. In particular, ACE finds transformations of predictors,

*f*, that make the relationship between

_{j}*Y*and the “new” transformed predictors

*f*(

_{j}*X*) as linear as possible and automatically selects such smooth functions

_{j}*f*, which maximize the correlation between the new predictors

_{j}*f*(

_{j}*X*) and

_{j}*Y*. ACE can be viewed as a nonparametric generalization of polynomial fitting methods in linear regression; that is, the optimal functions

*f*suggested by ACE could be splines. The degree of smoothness for the functions

_{j}*f*is chosen automatically. Since ACE is a nonparametric technique, the transformations

_{j}*f*are typically not available in an analytic form. Therefore, we utilize a graphical output of ACE. In particular, ACE produces scatterplots between the “old” predictors

_{j}*X*and the new transformed predictors

_{j}*f*(

_{j}*X*). Such scatterplots may indicate possible nonlinearities in the model and suggest an appropriate approximate parametric transformation of the predictors. For example, nonlinearities can be graphically approximated by

_{j}*m*piecewise linear functions. [For a more detailed description and application of the ACE methodology, see Hastie and Tibshirani (1990).]

At the second stage, we model the temporal variability of bias. We calculate the mean bias, that is, the mean difference between the forecasts and observations on any given Julian day, aggregated over the available model domain. In general, an extended analysis on the selection of the spatial aggregation domain can be performed using partitioning by CART. Typically, the best predictors of the temporal variability of bias are a set of *k* circular functions with different periods, that is, sin–cos of Julian days. This procedure of our method can be extended further by incorporating the diurnal information. However, for simplicity we focus only on daily biases.

*n*+

*m*+

*k*potential predictors. However, not all of the obtained predictors will be statistically significant. At the final stage of our algorithm, we apply the stepwise variable selection based on the Akaike information criterion (AIC; Weisberg 1985). For the regression model with

*L*predictors, AIC is defined as where

*σ*

^{2}

_{L}is the sum of the squares of the regression residuals and

*N*is the number of observations. AIC is the goodness of fit measure of an estimated statistical model that rewards for the higher accuracy of the model and penalizes for the increasing number of model parameters, that is, discouraging overfitting. The model with the lower AIC is preferred. Retaining statistically significant predictors results in a general spatiotemporal linear regression model that is directly applicable to estimating (interpolating) bias at an arbitrary grid point within the boundaries of the considered model domain. Upon some verification analysis, the obtained spatiotemporal model can be potentially used for predicting (extrapolating) bias outside of the considered model domain.

## 3. Applications

### a. Data

Daily 48-h MM5 forecasts of surface temperature (T2) at 12-km grid specification, initialized at 0000 UTC from the Aviation Model (AVN) of the National Weather Service’s National Centers for Environmental Prediction (NWS/NCEP) are gathered during the period from 3 January through 31 December of 2001 and 2002 over the Pacific Northwest. The data consist of observations and bilinearly interpolated forecasts of surface temperature (T2) as well as latitude, longitude, elevation, and land use categories at irregularly spaced meteorological stations. The data archive is operated and maintained by the above-mentioned research group as a part of the Pacific Northwest regional numerical prediction effort^{1} (Mass et al. 2003).

There are 24 land use categories within the MM5’s coarse resolution of 12 km, of which only 14 categories have been represented in this dataset (see Table 1). Detailed land use information for this domain has been derived from the 1-km U.S. Geological Survey (USGS) digital database, with some subjective modification from other data sources. Since not every land use category has enough representatives to draw a reliable analysis, we divide the land use information into four groups according to their physical properties and the mean error pattern: group I includes land use category 1; group II includes categories 2, 3, 5, 6, 7, 8, and 10; group III includes categories 11, 14, 15, 19, and 21; and finally group IV consists of category 16. The meteorological data from the same land use group will be treated identically. Generally speaking, the acceptable observation site needs to have raw land use values within the land use group of the grid point of interest. Hence, if the default MM5 initialization software is used and the raw land use information (1-km resolution) is utilized to get the predominant land use type for the 12-km grid cell, it implies that for a given 12-km grid cell containing 60% of category 7 “grassland” (group II) and 40% of category 16 “water” (group IV), the assigned land use type would be “grassland,” or “group II.” Thus, acceptable observation sites need to have raw 1-km land use values within group II. However, if extra information is available, for example, that a given 12-km grid cell contains 60% of category 7 grassland(group II) and 40% water(group IV), then it is possible to include the sites from the water land use group IV with the corresponding weights, that is, relative to the fraction of water in the grid cell or to obtain alternative weights from model training using historical biases.

To estimate the semivariogram in the LOB method and to train the model of the CART–ACE method, we use the observed biases at 100 (randomly selected) stations across the Pacific Northwest. We illustrate and test the proposed methods by application of bias removal at 40 randomly selected stations across the Pacific Northwest in 2002. For verification purposes, the above-mentioned 40 stations are not part of the training set of 100 stations from 2001.

### b. Results for the LOB method

The first step in the LOB method is to estimate the reasonable size of a neighborhood. Figure 1 shows the empirical semivariogram of the biases observed at 100 stations from 3 January to 31 December 2001. The estimate of the range *r* obtained by the ML method is approximately 120 km, which implies that the stations located in the circle of 120-km radius are somewhat correlated while an extension of the neighborhood over 120 km will not generally provide any improvements. (For the considered variogram, both the ML and NLS methods have provided similar estimates of the range *r*.) In terms of elevation, we empirically use the range of ±200 m.

Therefore, we define the neighborhood as all stations that are located within the 120-km radius relative to the site of interest; are within ±200 m of the difference between the terrain height at the site of interest, that is, the 12-km grid point, and the terrain height at the observation site; and belong to the same land use information group.

Figure 2 shows the MAE over the period from 3 January to 31 December 2001 measured for the LOB and RLOB methods as well as for the obs-based method of Wedam et al. (2005) with the simple averaging (SA) and exponential weighing (EW) schemes. Based on the LOB-SA and obs-based curves in Fig. 2, the optimal time window *T* for the simple averaging scheme is approximately 2 weeks. The extension of the time window by an additional 2 weeks does not provide any significant improvement but in fact deteriorates the results of the obs-based method.

Table 2 presents the obtained verification results for MAE, the mean, and the standard deviation of the “raw” and “corrected” biases along with the improve-to-hurt ratios for more than 2 K and the relative changes with respect to raw MAE in percent for the overall period from 3 January to 31 December 2002 and for each quarter of 2002. The improve-to-hurt ratios are defined as the fraction of cases calculated for all stations over the considered period when the improvement for the absolute value of bias is more than 2 K versus the of fraction of cases when the deterioration of the absolute value of bias is more than 2 K.^{2} Among the neighbor-based methods, the superior MAE reduction of 14.8% has been provided by the RLOB with the EW scheme (RLOB-EW). The second-best result, an MAE reduction of 12.4%, is provided by the robust LOB with the SA scheme (RLOB-SA), which is followed by a 10.7% reduction using the obs-based method with the EW scheme. The smallest MAE reduction of 10.1% has been shown by the original obs-based method of Wedam et al. (2005). In terms of variance and the mean of the bias, all neighbor methods perform similarly. Similar results are obtained for each quarter of 2002 with the best reduction of MAE seen in July–September 2002 and the worst reduction for January–March 2002.

Finally, the best result of 3.1 for the improve-to-hurt ratios for the overall period of 2002 in Table 2 has been shown by the original obs-based method, which is followed by 3.0 of RLOB with the EW schemes. The obs-based method with the EW schemes and the RLOB method with the SA scheme show similar results of 2.4 and 2.3, respectively.

### c. Results for the CART–ACE method

As the first step of the CART–ACE algorithm, we focus on the spatial variation of bias. We aggregate the bias over the first 3 months of 2001 across 100 stations in the Pacific Northwest. This time window has been chosen empirically in order to minimize random fluctuations.

Here, lon, lat, and alt denote longitude, latitude, and terrain height, respectively. Since AIC penalizes for including predictors that are not sufficiently important for model fitting, some of the intermediate predictors are dropped and only the most significant eight predictors remain in the final regression model. The first three of the binary predictors correspond to the ACE transformation of the terrain height and longitude and the rest of the predictors correspond to the six CART nodes.

To model the temporal variability of bias, we average it daily across 100 stations and fit linear regression with four pairs of the circular functions cos(2*πjd*/365) and sin(2*πjd*/365) for *j* = 1, 2, 3, and 4, where *d* is a Julian day of the year. All of these circular predictors have been found to be statistically significant.

Next, we combine eight spatial binary predictors, four pairs of circular temporal predictors, and five binary predictors corresponding to the land use groups. This leads to a spatiotemporal model for bias that may be applied at every site of interest at every point in time.

As a natural extension of the CART–ACE method, we combine it with the LOB technique. In particular, we use the RLOB-EW method as an extra predictor in our spatiotemporal CART–ACE model.

Table 2 presents the results of the regression-based methods for 40 randomly selected stations for the whole year of 2002 and for each quarter of 2002. A superior MAE reduction of 20.5% is shown by the combined CART–ACE-RLOB method, which is followed by a 17.1% MAE reduction provided by the “raw” CART–ACE method. The third-best result is 14.8%, which is provided by the RLOB-EW method. Analogous performance is shown for each quarter of 2002. In general, the neighbor-based methods show noticeably poorer results in reducing MAE. In terms of the mean and variance of the “corrected” bias, the regression-based methods show results similar to those of the neighbor-based approaches. Finally, the raw CART–ACE method shows the best result for the improve-to-hurt statistics among all of the proposed methods, that is, 3.5, which is followed by 3.1 provided by the original obs-based method of Wedam et al. (2005) and 3.0 provided by the RLOB-EW method.

Figure 5 shows the spatial distribution of biases on a particular day and presents an application of the proposed methods to a correction of bias from the 48-h MM5 forecast of surface temperature (initialized on 10 September 2002) at 0000 central standard time (CST). There exists a noticeable underestimation of the surface temperature for most of the domain, with the exception of a few areas of overestimation in the western part of Montana and along the border of Idaho and Nevada (see Fig. 5a). The obs-based method slightly corrects this overall negative bias and provides noticeable improvement in central Oregon, while the temperature is still substantially underestimated for most of the domain (see Fig. 5b). In Fig. 5c the RLOB-EW method substantially improves the situation; that is, there remain only a few spots of negative bias and some spots of positive bias. Finally, the combination of the CART–ACE and RLOB methods (CART–ACE-RLOB) provides a noticeably more uniform map of bias with only a few remarkable areas of over- or underestimated temperatures (see Fig. 5d).

## 4. Discussion

This paper has discussed two main approaches for the gridded bias correction in mesoscale numerical weather prediction: the LOB method, or the local neighbor approach, and the CART–ACE method, or the regression approach. The LOB method is based on a temporally weighted spatial composition of recent observations over a “neighborhood” of weather observing sites while taking into account land use categorization. The CART–ACE method is based on the spatiotemporal analysis of bias using modern statistical nonparametric regression techniques such as alternative conditional expectation (ACE) and regression trees. Both approaches can be applied to any site of interest without any history of bias measurement at that particular site.

Although in general the CART–ACE approach shows an appreciable overall improvement in the MAE reduction compared to the LOB approach, the regression-based methods show minimal to no advantage, if any, in terms of mean, variance, and the “improve to hurt” ratio. The superiority in the MAE reduction of the CART–ACE approach is stipulated mainly by removing bias at stations with no neighbors. The disadvantage of the CART–ACE method is that this method requires a longer training time window *T*; for example, in the case of the Pacific Northwest, we need about three training months for the CART–ACE method compared to two training weeks for the LOB method. Finally, the last but not the least disadvantage of the CART–ACE approach is that this method is not fully automatic; that is, at one step, one needs to provide a reasonable piecewise-linear approximation to the ACE and CART transformations, which may require some manual analysis. Thus, the overall conclusion is to use the CART–ACE method mostly for sparse areas with no neighbors within a reasonable proximity or for some locally run projects that do not require a fully automatic bias removal with a very high data burden.

The LOB method proposed here is based on a combination of three elements: exponential weighting in time, semivariogram-defining spatial representation of neighbors, and robust modification (robustification) against anomalies (i.e., elimination of outliers). The analysis of the LOB method and the comparison with the obs-based method of Wedam et al. (2005) show the superiority of the exponential weighing (EW) scheme in time over the simple averaging (SA) scheme. In addition, the EW scheme requires a training time that is half as long, that is, about 1 week versus 2 weeks for the SA scheme (see Fig. 2). The richer spatial representation with the neighborhood defined by the semivariogram provides a higher reduction of the MAE than does the obs-based method with the subjective choice of neighbors. The enrichment of the spatial representation that acts as an additional source of information may also be intuitively interpreted as an “increase” of the sample size and as a stabilizing factor against random fluctuations in the data. Finally, the robust modification of the LOB method with respect to anomalies (outliers) is found to be a very valuable asset of the LOB method since the most superior results for the MAE reduction have been shown to be given by the robust version of the LOB (RLOB) method, which slightly concedes in terms of the improve-to-hurt statistics to the regime-adaptive obs-based method for the set of eight stations. To take into account regime changes, the natural approach is to combine RLOB with the idea of Wedam et al. (2005) and use only dates with similar parameter values. However, the question that remains open is–how similar is similar? One can use a more statistically justified framework, that is, run a correlation test and use only dates for which the correlation coefficient is greater than a certain selected threshold. As a conclusion, the natural future extension of the neighbor approach is to combine the following components of the proposed RLOB method and the obs-based method of Wedam et al. (2005) into one method: the robust modification toward outliers (anomalies), exponential weighting in time, the semivariogram-defined neighborhood, and the regime-adaptive scheme.

The forecast bias is a nonstationary process in time and space. Hence, the optimal definition of “neighborhood” may vary from one geographical area to another and between seasons. Therefore, another possibility for further development of the LOB approach may be an analysis of the optimal definition of neighborhoods. For example, one can divide a spatial domain into local subdomains using the regression tree and proceed with a detailed study in each local subdomain.

Another extension of the LOB method refers to the application of a variogram method only to stations of similar land use type and elevation, which should potentially extend correlation much further and significantly improve the method. However, utilizing only stations of similar land use type and elevation will lead to nonsimple connectivity of data and thus will imply that the Euclidean metric will no longer be suitable. For example, suppose that we restrict locations only to a selected land use information type, for example, “evergreen forest” and then apply Euclidean metrics between sites belonging to the evergreen forest category. Some areas of evergreen forest may be partially surrounded by other land use information areas, for example, by “water bodies,” or moreover isolated from each other. Mathematically speaking, this will lead to nonsimple connectivity of the selected land use information type (evergreen forest); that is, there will exist no line segment connecting two locations in evergreen forest that entirely lies in evergreen forest. Hence, the Euclidean distance is no longer considered to be a good indicator of similarity between the two locations and a variogram based on the Euclidean distances will be unreliable. The possible solution is to make use of the city block, or Manhattan distance, instead of the Euclidean metric. However, this approach is a challenging mathematical problem and constitutes a topic for an independent study in statistics. Some applications of city-block distances have recently appeared in the statistical literature but have not been thoroughly investigated yet (see Krivoruchko and Gribov 2004).

In addition, it should be noted that although only the case of the daily data is considered, the LOB approach may be extended further to incorporate diurnal information; the type of station, that is, the agency to which the neighboring station belongs; climatology; etc. For example, there are a few possible approaches concerning how the diurnal information can be incorporate into the modeling process. Depending on the data under consideration, the diurnal information may be coded as an extra binary, as circular, or as a product of binary and circular predictors in the CART–ACE method. If the diurnal variation does not significantly depend on seasonal variation, which is unlikely the case, and data are observed, for example, only twice per day, then one can include just a binary predictor in the CART–ACE method. If data are observed more frequently but still there is no strong correlation between the diurnal and seasonal variations, one can include an extra circular diurnal predictor into the model. Finally, if there exists a significant correlation between various temporal variations, one should include a product of circular diurnal and seasonal predictors. The significance of each predictor can be assessed through routine statistical methods of linear regression such as *t* and *F* tests and various variable selection criteria. The question of diurnal information for the LOB method may be resolved through incorporating time-varying weighting schemes for different hourly periods, in which more recent observation will still have a higher weight, that is, a greater impact, than older observations, but coefficients in those weighting schemes will depend on the diurnal time period under consideration. Estimation of the optimal time-varying coefficients can be performed by time series analysis of historical diurnal bias and, in a more general perspective, by methods of the Bayseian statistics.

In general, all of the additional information may be incorporated into the adaptive weighing of neighbors, which generally can be fully automatic. This may be implemented, for example, by using methods of the Bayesian networks. Such an approach is similar to the data assimilation methods, which may be interpreted as a combination of various sources of information (Daley 1991).

## Acknowledgments

The data have been kindly provided by the research group of Professor Clifford Mass in the Department of Atmospheric Sciences at the University of Washington. The author is grateful to C. Mass, Z. Toth, R. Steed, and M. Albright for very helpful discussions and comments and for providing the data. The author also wishes to thank A. Raftery and T. Gneiting for very stimulating discussion, suggestions, and comments, which are hard to overestimate. This research was in part supported by the National Sciences and Engineering Research Council of Canada. This work was made possible by the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET; see http://www.sharcnet.ca).

## REFERENCES

Applequist, S., , Gahrs G. E. , , and Pfeffer R. L. , 2002: Comparison of methodologies for probabilistic quantitative precipitation forecasting.

,*Wea. Forecasting***17****,**783–799.Breiman, L., , and Friedman J. H. , 1985: Estimating optimal transformations for multiple regression and correlations (with discussion).

,*J. Amer. Stat. Assoc.***80****,**580–619.Breiman, L., , Friedman J. H. , , Olshen R. A. , , and Stone C. J. , 1984:

*Classification and Regression Trees*. Wadsworth and Brooks/Cole, 358 pp.Buja, A., , and Kass R. E. , 1985: Estimating optimal transformations for multiple regression and correlation: Comment.

,*J. Amer. Stat. Assoc.***80****,**602–607.Burrows, W. R., , Benjamin M. , , Beauchamp S. , , Lord E. R. , , McCollor D. , , and Thompson B. , 1995: CART decision-tree statistical analysis and prediction of summer season maximum surface ozone for Vancouver, Montreal, and Atlantic regions of Canada.

,*J. Appl. Meteor.***34****,**1848–1862.Christy, J. R., 2002: When was the hottest summer? A state climatologist struggles for an answer.

,*Bull. Amer. Meteor. Soc.***83****,**723–734.Cressie, N., 1993:

*Statistics for Spatial Data*. Wiley, 928 pp.Daley, R., 1991:

*Atmospheric Data Analysis*. Cambridge University Press, 457 pp.Fielding, A. H., 1999:

*Ecological Applications of Machine Learning Methods*. Kluwer Academic, 280 pp.Gel, Y., , Raftery A. E. , , and Gneiting T. , 2004: Combining global and local grid-based bias correction for mesoscale numerical weather prediction models. Preprints,

*17th Conf. on Probability and Statistics in the Atmospheric Sciences,*Seattle, WA, Amer. Meteor. Soc., CD-ROM, 1.9.Glahn, H. R., 1985: Statistical weather forecasting.

*Probability, Statistics, and Decision Making in the Atmospherics Sciences,*A. R. Murphy and R. W. Katz, Eds., Westview Press, 289–335.Glahn, H. R., , and Lowry D. A. , 1972: The use of model output statistics (MOS) in objective weather forecasting.

,*J. Appl. Meteor.***11****,**1203–1211.Hastie, T. J., , and Tibshirani R. J. , 1990:

*Generalized Additive Models*. CRC Press, 352 pp.Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Reanalysis Project.

,*Bull. Amer. Meteor. Soc.***77****,**437–471.Krivoruchko, K., , and Gribov A. , 2004: Geostatistical interpolation and simulation with mon-Euclidean distances.

*GeoENV IV—Geostatistics for Environmental Applications,*X. Sanchez-Villa, J. Carrera, and J. J. Gomez-Hernandez, Eds., Kluwer Academic, 331–342.Kuligowski, R. J., , and Barros A. P. , 1998: Localized precipitation forecasts from a numerical weather prediction model using artificial neural networks.

,*Wea. Forecasting***13****,**1194–1204.Mao, Q., , McNider R. T. , , Mueller S. F. , , and Juang H-M. H. , 1999: An optimal model output calibration algorithm suitable for objective temperature forecasting.

,*Wea. Forecasting***14****,**190–202.Marzban, C., 2003: Neural networks for postprocessing model output: ARPS.

,*Mon. Wea. Rev.***131****,**1103–1111.Mass, C. F., 2003: IFPS and the future of the National Weather Service.

,*Wea. Forecasting***18****,**75–79.Mass, C. F., , and Kuo Y. H. , 1998: Regional real-time numerical weather prediction: Current status and future potential.

,*Bull. Amer. Meteor. Soc.***79****,**253–263.Mass, C. F., and Coauthors, 2003: Regional environmental prediction over the Pacific Northwest.

,*Bull. Amer. Meteor. Soc.***84****,**1353–1366.Nott, D. J., , Dunsmuir W. T. M. , , Kohn R. , , and Woodcock F. , 2001: Statistical correction of a deterministic weather prediction model.

,*J. Amer. Stat. Assoc.***96****,**794–804.Pielke R. A. Sr., , , Stohlgren T. , , Parton W. , , Doesken N. , , Moeny J. , , Schell L. , , and Redmond K. , 2000: Spatial representativeness of temperature measurements from a single site.

,*Bull. Amer. Meteor. Soc.***81****,**826–830.Spark, E., , and Connor G. J. , 2004: Wind forecasting for the sailing events at the Sydney 2000 Olympic and Paralympic Games.

,*Wea. Forecasting***19****,**181–199.Stensrud, D. J., , and Skindlov J. A. , 1996: Gridpoint predictions of high temperature from a mesoscale model.

,*Wea. Forecasting***11****,**103–110.Stensrud, D. J., , and Yussouf N. , 2003: Short-range ensemble predictions of 2-m temperature and dewpoint temperature over New England.

,*Mon. Wea. Rev.***131****,**2510–2524.Tebaldi, C., 2002: Looking far back vs. looking around enough: Operational weather forecasting by spatial composition of recent observations. Preprints,

*16th Conf. on Probability and Statistics in the Atmospheric Sciences,*Orlando, FL, Amer. Meteor. Soc., CD-ROM, 3.8.Vislocky, R. L., , and Fritsch J. M. , 1995: Generalized additive models versus linear regression in generating probabilistic MOS forecasts of aviation weather parameters.

,*Wea. Forecasting***10****,**669–680.Wedam, G., , Mass C. F. , , and Steed R. , 2005: Grid-based bias removal of surface parameters.

*Pacific Northwest Weather Workshop,*Seattle, WA, University of Washington and NWS. [Available online at http://www.wrh.noaa.gov/sew/WorkShop_05/session_3/Wedam.pdf.].Weisberg, S., 1985:

*Applied Linear Regression*. Wiley, 324 pp.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences*. Academic Press, 467 pp.Wilson, L. J., , and Vallee M. , 2002: The Canadian Updateable Model Output Statistics (UMOS) system: Design and development tests.

,*Wea. Forecasting***17****,**206–222.Yuval, , and Hsieh W. H. , 2003: An adaptive nonlinear MOS scheme for precipitation forecasts using neural networks.

,*Wea. Forecasting***18****,**303–310.

Summary of surface types and land use characteristics.

Before and after bias removal MAE, mean, std dev, and improve-to-hurt ratios (>2 K) of the biases of the MM5 surface temperature forecasts (K) and relative changes in respect to “raw” MAE (%) at 40 randomly selected stations over the Pacific Northwest from 3 Jan to 31 Dec 2002. The time window *T* is 2 weeks. The RLOB-CART–ACE method utilizes the EW scheme. Boldface indicates calculations performed for the whole year (January–December 2002).

^{1}

Data have been kindly provided by the research group of C. Mass of the Department of Atmospheric Sciences at the University of Washington. Detailed information about the Pacific Northwest prediction effort and the associated data archive can be found online (www.atmos.washington.edu/mm5rt/info.html and www.atmos.washington.edu/marka/pnw.html, respectively).

^{2}

Mathematically, bias (*b*) can be defined as *b* = fcst − obs, where fcst and obs are forecasted and observed quantities, respectively. Hence, bias mean and std dev over a space domain *S* and a time period *T* are defined as * b* = (1/

*N*

_{ST})Σ

_{s∈S}Σ

_{t∈T}

*b*

_{st}and std

_{b}=

*N*is the number of points across

_{st}*S*and

*T*. MAE is defined as (1/

*N*

_{ST})Σ

*s*∈

*S*Σ

*t*∈

*T*|

*b*

_{st}|. Finally, the improve-to-hurt ratios for more than 2 K are defined as the number of cases (|

*b*| − |

*b*

_{corrected}|)

_{1(if>2K)}/[Num of cases (|

*b*| − |

*b*

_{corrected}|)

_{1(if<−2K)}] over

*S*and

*T*, where

*b*

_{corrected}is the “corrected” bias.