## Abstract

Some climate datasets are incomplete at certain places and times. A novel technique called the point estimation model of Biased Sentinel Hospitals-based Area Disease Estimation (P-BSHADE) is introduced to interpolate missing data in temperature datasets. Effectiveness of the technique was empirically evaluated in terms of an annual temperature dataset from 1950 to 2000 in China. The P-BSHADE technique uses a weighted summation of observed stations to derive unbiased and minimum error variance estimates of missing data. Both the ratio and covariance between stations were used in calculation of these weights. In this way, interpolation of missing data in the temperature dataset was improved, and best linear unbiased estimates (BLUE) were obtained. Using the same dataset, performance of P-BSHADE was compared against three estimators: kriging, inverse distance weighting (IDW), and spatial regression test (SRT). Kriging and IDW assume a homogeneous stochastic field, which may not be the case. SRT employs spatiotemporal data and has the potential to consider temperature nonhomogeneity caused by topographic differences, but has no objective function for the BLUE. Instead, P-BSHADE takes into account geographic spatial autocorrelation and nonhomogeneity, and maximizes an objective function for the BLUE of the target station. In addition to the theoretical advantages of P-BSHADE over the three other methods, case studies for an annual Chinese temperature dataset demonstrate its empirical superiority, except for the SRT from 1950 to 1970.

## 1. Introduction

Nearly all instrumental time series are affected by missing values (Simolo et al. 2010). Currently, there are two mainstream approaches to treat missing data. One is to use only a subset of continuous records in a dataset, and another is to ignore missing data based on the assumption that they represent one continuous series. In the former approach, prior information will be wasted and true statistical inferences cannot be made, whereas the latter approach shrinks the period of record and thus overestimates the likelihood of extreme events (Di Piazza et al. 2011; Tang et al. 1996).

To overcome these problems, a number of interpolation techniques have been developed, aimed at estimating missing observations in climatic time series (Kaplan et al. 2000; Li et al. 2003; Lin et al. 2002; Reynolds and Smith 1994; Simolo et al. 2010; Snell et al. 2000). These methods differ in concept and mathematical formulation (Burrough et al. 1998; Haining 2003). Among these, common methods for estimation of missing data in climate datasets are regression-based methods, kriging, and inverse distance weighting (IDW).

Regression-based approaches are frequently used to interpolate missing data (Eischeid et al. 1995, 2000). In these, records from surrounding stations or external information (Daly 2006; Stahl et al. 2006) are used as explanatory variables to develop a regression equation, which is subsequently used to estimate the missing data. In addition to the traditional methods, such as simple or multiple regression models (Allen and DeGaetano 2001; Di Piazza et al. 2011; Eischeid et al. 1995), more sophisticated models have been introduced. Geographically weighted regression (GWR) weights observation data by distance (Fotheringham et al. 2002). Spatial lag and spatial error models take spatial autocorrelation into account (Fischer and Wang 2011). The Parameter-Elevation Regressions on Independent Slopes Model (PRISM) allows incorporation of expert knowledge about climate into regressions (Daly et al. 2002). The spatial regression test (SRT) (Hubbard and You 2005a; Hubbard et al. 2005) uses neighboring stations to estimate the measurement at the station of interest, and then the value of target station is calculated by the weighted average of the estimates. In this process, the weights are calculated according to the strength of the relationship, as quantified by root-mean-square error (RMSE) values between the station of interest and each neighboring station.

Other methods such as local interpolation only use climatic data from weather stations. Among these, IDW is usually used for missing data estimation (Di Piazza et al. 2011). It computes a weighted average, using the inverse distance between target and surrounding station. Several variants of IDW have been developed, with principal focus on weighting schemes (Teegavarapu and Chandramouli 2005; You et al. 2008).

Kriging and its variants (Goovaerts 1997; Isaaks and Srivastava 1989; Olea 1999) have been applied extensively to interpolation of climate data (Boer et al. 2001; Hudson and Wackernagel 1994; Jeffrey et al. 2001). These methods assure unbiased predictions with minimum spatial variance, according to spatial variation of the data (Burrough et al. 1998).

In this study, a method called the point estimation model of BSHADE (P-BSHADE) is introduced to interpolate missing data in a temperature dataset. It is a variant of the Biased Sentinel Hospitals-Based Area Disease Estimation (BSHADE) model proposed by Wang et al. (2011). It takes into account prior knowledge of geographic spatial autocorrelation and nonhomogeneity of target domains, remedies the biased sample, and maximizes an objective function for best linear unbiased estimation (BLUE) of the regional mean (total) quantity (Wang et al. 2011). P-BSHADE is developed in this study based on BSHADE. We assume in the model that the spatial distribution of temperature is nonhomogeneous. In addition, the correlation and ratio between stations are considered. The effectiveness of the technique is empirically evaluated in terms of an annual temperature dataset from 1950 to 2000 in China.

## 2. Method

### a. Objective

The objective is to interpolate missing records in the temperature dataset based on observed data from other stations. A theoretical description and derivation of the formula are expressed as follows:

where *w _{i}* denotes the weight (contribution) of the

*i*th observed station to the station to be interpolated, and is an estimate of the missing record

*y*

_{0}. As expected, the two properties of the estimate of Eq. (1) are unbiasedness,

and minimum estimation variance,

### b. Ratio between stations

The ratio between stations is one of the most important inputs for estimation of missing temperature records in a climate dataset, and it is an index of heterogeneity in the temperature spatial distribution. P-BSHADE is based on consideration of actual situations, in which the distribution of phenomena such as surface temperature is nonhomogeneous (i.e., ). The relationship between two locations can be expressed as

where *b _{i}* is the ratio between temperatures at the two stations. Considering Eq. (1), Eq. (5) can be written as (see the appendix)

This equation is generally valid for a nonhomogeneous condition. Clearly, determination of requires calculation of coefficients *w _{i}* (

*i =*1, … ,

*n*), which is addressed in the following section.

### c. Estimation of weight

In view of the above considerations, the estimation problem is to find the weights *w _{i}* (

*i*= 1, … ,

*n*) in Eq. (6) that satisfy the unbiased condition [Eq. (2)] and minimize estimation variance [Eq. (3)]. The second condition implies that these weights can be calculated by minimizing the estimation variance of Eq. (1), that is,

where *C* denotes statistical covariance between two stations at different locations. Minimizing with respect to weights *w _{i}* (

*i*= 1, … ,

*n*) and taking into account the unbiasedness, Eq. (6) gives

where *μ* is a Lagrange multiplier (see the appendix). The minimized estimation error variance can be then written as

## 3. Test of P-BSHADE

### a. Data for case study

An annual climate dataset from 1950 to 2000 was used to illustrate the performance of P-BSHADE. This dataset was constructed using monthly data from the Chinese National Meteorological Centre (CNMC), which was quality controlled and homogenized by Li et al. (2009, 2010). In the early 1950s, the spatial distribution of stations was very sparse, and stations were mainly in eastern China. The number of stations increased sharply until around 1960, gradually reaching a stable level around that year. The number maximized from 1971 to 2000, with high data quality. In this period, all series with no yearly missing data were used as reference series, in which the total number of stations is 582. During the overall historical period, stations were evenly distributed across eastern China. The distance between stations increased in the western and northern parts of the country. Figure 1 indicates the number of annual stations from 1950 to 2000. Figure 2 shows the distribution of stations in 1950, 1960, 1970, and 1971–2000. Station numbers between 1971 and 2000 were stable, so their distribution in this period is presented in a single figure.

### b. P-BSHADE algorithm

Estimation of missing data in the temperature dataset using P-BSHADE included four steps, as follows:

Calculating relationships between stations in reference series. Stations in the reference series from 1971 to 2000 were used to compute ratios

*b*, covariances_{i}*C*, and correlations_{ij}*R*; ratios_{ij}*b*were calculated by average values in the reference time series between two stations. The_{i}*b*was assigned to station_{i}*i*, and*C*and_{ij}*R*linked records at stations_{ij}*i*and*j.*Determining relationships between two stations in the annual record. The ratio

*b*, covariance_{i}*C*, and correlation_{ij}*R*between two stations for a year were derived from the relationship of those stations in the reference series, which is actually used in P-BSHADE. This approach is based on the fact that this series was of high quality and reliably representative of the annual relationship between the two stations. The relationship between stations and reference series is shown in Fig. 3. For a given year,_{ij}*y*_{0}is a missing station to be interpolated, and*y*_{1},*y*_{2}, … ,*y*are observed stations. Also,_{n}*C*is covariance between stations_{ij}*i*and*j*;*b*is the ratio between observed stations_{i}*i*and predicted*y*_{0}. The values of*C*and_{ij}*b*for the annual record are calculated by their corresponding reference series._{i}Solving P-BSHADE. As the

*b*and_{i}*C*parameters were obtained, for each estimation weights_{ij}*w*and_{i}*μ*were computed in terms of Eq. (8). In the calculation, we selected neighboring observation stations with the highest correlations with predicted stations, positive weights and smallest estimated error variance. The numbers of neighboring stations used in the calculations were 5, 10, and 15.Estimation of missing records using Eq. (1), simultaneous with estimation error variance calculation by Eq. (9).

### c. Performance

To assess P-BSHADE performance, annual station records in China from 1950 to 2000 were estimated by leave-one-out cross-validation (Kohavi 1995). Results were compared with ordinary kriging, IDW, and SRT. For IDW, inverse square distance was used for weights of neighboring stations. For SRT, the width of the time window was set to 24 yr; the number of “best fit” neighboring stations was taken from three selection schemes described below. Surrounding stations were not selected by specifying a radius, but instead by twice the number of neighbor selection schemes, because stations were unevenly distributed across China in space and time (Fig. 2).

Three neighbor selection schemes (5, 10, and 15) were used in the calculation to determine the number of neighboring stations. Results (Table 1; Figs. 4 and 5) show that for the five periods 1950–60, 1961–70, 1971–80, 1981–90, and 1991–2000 the average RMSE and mean absolute error (MAE) calculated by P-BSHADE and SRT were much lower than with the two other methods. For 1950–70, P-BSHADE calculated using five neighboring stations had slightly lower errors than P-BSHADE using 10 and 15 neighboring stations, but they were somewhat higher than errors estimated by SRT. For 1971–2000 and P-BSHADE with 15 neighbors, errors were the smallest among the other three methods and the other two neighbor selection schemes (5 and 10). The estimate of P-BSHADE is compared with the other three methods in the following text.

Figure 4 shows RMSE for the four methods. Compared with kriging and IDW, RMSE of P-BSHADE was by far the lowest among the three neighbor selection schemes. For example, using five neighboring stations, average RMSEs from 1950 to 2000 for P-BSHADE, kriging, and IDW were 0.24°, 2.36°, and 2.43°C, respectively. Average RMSE of P-BSHADE in the reference series from 1971 to 2000 was very small (0.16°C); there was a slight rise from 1970 (0.23°C) to 1950 (0.50°C). Compared with SRT, for 1950–70 the RMSEs of P-BSHADE were somewhat higher, while for 1971–2000 they were slightly lower.

MAE indexes of the four methods are shown in Fig. 5. All MAEs for the four methods were smaller than their corresponding RMSEs, which indicate extreme errors in these estimations, especially for kriging and IDW. For P-BSHADE estimates, in 1950–70 the lowest MAE appeared in the estimation for five neighboring stations, while in 1971–2000 they appeared for 15 neighboring stations. Similar to the RMSE index, P-BSHADE had obviously lower MAEs than kriging and IDW in all periods for the three neighbor selection schemes. Compared with SRT, for 1950–70 the MAEs of P-BSHADE were somewhat higher, while for 1971–2000 they were slightly lower.

The maximum absolute error for all stations in each year can be used to test stability and worst performance for the interpolation methods. Figure 6 shows yearly maximum absolute error of annual temperature from 1950 to 2000, using five neighboring stations. Results indicate that yearly maximum errors of P-BSHADE and SRT were much smaller than with the other two methods. Compared with SRT, the index of P-BSHADE was somewhat higher for 1950–70, but was generally lower for 1971–2000. This result is validated by Fig. 7, which provides several scatterplots between observed annual temperature and those estimated by the four methods.

Annual absolute error at each station for the four methods is shown in Fig. 8 (P-BSHADE) and Fig. S1 (IDW, kriging, and SRT) in the supplemental material for the years 1950, 1960, 1970, 1980, 1990, and 2000. Stations with largest errors using P-BSHADE were mainly around the eastern edge of the Qinghai-Tibet Plateau. In regions of homogeneous terrain, such as the interior of that plateau and the plains of eastern China, errors were smaller. Reasons for this are discussed in detail below. Error distributions calculated by kriging and IDW were similar. Greater errors were mainly west of 105°E, where there were fewer stations and complex terrain, with mountains and plateaus. Larger errors of SRT were mainly in mountainous regions with few stations.

## 4. Discussion

A novel method called P-BSHADE for estimating missing data in annual temperature series has been introduced. P-BSHADE has theoretical advantages over kriging, IDW, and SRT because of a more realistic assumption. Case studies using an annual temperature dataset from China demonstrated its empirical superiority, except for SRT during 1950–70.

Compared with other three methods, the P-BSHADE model has an advantage in interpolation of missing records within the temperature dataset. The coexistence of homogeneity and heterogeneity in land surface temperatures (Wang et al. 2009, 2010a) is considered in P-BSHADE, whereas kriging and IDW assume homogeneity of a target. Heterogeneity is reflected in SRT, which has the potential to solve systematic differences caused by different topography or location. The characteristic of heterogeneity in the P-BSHADE model is expressed as ratio *b _{i}* between two stations. Comprehensive and accurate information of series from recent decades was used in the analysis. The above considerations are based on the following considerations: 1) Sometimes, temperature between nearby stations has greater correlation because of their similar geographic location and land surface characteristics (Böhm et al. 2001). This is a general phenomenon for the spatial distribution of temperature. 2) Surface air temperature is frequently affected by various local factors, such as terrain, land-use type, and others (Daly et al. 2002; Karl and Jones 1989; Yan et al. 2010). For example, Böhm et al. (2001) found that distant stations have higher correlation than nearby ones because of similar geography. This results in nonhomogeneous temperature distributions. 3) Temperature series from recent decades, which are often used as reference series, embody comprehensive datasets of high quality. They can therefore be used to provide stable information for relationships between annual stations in the dataset. These relationships are useful for station pairs in all time series, especially for the early period with its sparse stations, to compensate scarce and imprecise information.

The P-BSHADE calculation of covariance between stations and selection of neighboring stations contrast with kriging and IDW. 1) In kriging, covariance between stations is calculated using tools of the variogram, the theoretical assumption of which is spatial second order stationary (Goovaerts 1997; Isaaks and Srivastava 1989). This is not realistic in some cases (Wang et al. 2009). To compensate for this deficiency, unlike kriging, covariance in P-BSHADE was calculated by a reference series from recent decades. Because of equipment improvement in recent years and an increased number of stations, data quality is the highest of the historical period. Thus, the relationship between stations is robust and reliable for the recent period. 2) Kriging and IDW select neighboring stations in space, based on the theory of the geographic first law (Wang et al. 2010b). If the local geographic environment is substantially different from its surrounding area, this produces larger errors. Instead, the P-BSHADE model uses correlation between stations, selecting several observation stations of greatest correlation. Stations with greater errors from kriging and IDW were mainly in western China (Fig. S1 in the supplemental material) because of complex terrain and sparse stations in that region. Larger errors from P-BSHADE were mainly around the eastern edge of the Qinghai-Tibet Plateau because it is a transition zone between plateau and mountain. Moreover, the East Asian and South Asian monsoons strongly influence temperature in the area. Table S1 in the supplemental material shows that the five stations with largest errors have smaller correlations between observation stations; whereas the five stations with smallest errors have higher correlations between observation stations.

The SRT is estimated by a weighted summation of several best-fit estimates for the target station using neighboring stations, rather than observations at neighboring stations as in the other methods. Spatiotemporal data have to be employed in SRT, whereas P-BSHADE only uses spatially neighboring stations with parameters derived from reference series. In addition, if there are missing data at a station for a long period, SRT fails to work, because the regression equation between series of the target station and neighboring stations cannot be established in such a case.

The P-BSHADE model is applicable to datasets with complete, stable, and high-quality records in the reference period, from which the relationship between annual stations is derived. In addition, one of the model assumptions is that the relationship between stations is stable across the entire time series, so the relationship in the reference series can be determined even in the early period. This is appropriate for the temperature dataset used here. In our study, estimation errors of P-BSHADE in 1950–70 were somewhat greater than those in 1971–2000 because there are some uncertainties in applying relationships of the reference series to their corresponding yearly stations (see the supplemental material).

For a small number of stations, the P-BSHADE model yields negative weights. The phenomenon may be caused by several reasons, such as the screen effect (Goovaerts 1997; Isaaks and Srivastava 1989). To deal with the problem, the approach introduced by Szidarovszky et al. (1987) was employed in the study. The core idea of the method is to find neighboring stations having positive weights with the smallest error variance in all possible subsets of neighboring station combinations. By the method, 99.98% predicted stations have neighboring stations with positive weights. The remaining very few stations with negative weights were eliminated using the method proposed by Deutsch (1996), which resets the negative weights to 0 and restandardizes the remaining weights to sum to 1. The method guarantees an unbiased estimation, although it does not guarantee the minimum error variance, which might introduce slight uncertainty into the result.

The P-BSHADE method can be extended to other applications—for example, to interpolate temperatures to any location, not only for stations with missing data. It can also be used to assess uncertainty in the existing record, because there always errors in the dataset. The P-BSHADE model can predict observed values robustly and accurately.

## Acknowledgments

This study was supported by CAS (XDA05090102), NSFC (41023010; 41271404), and MOST (2012CB955503; 2012ZX10004-201; 2011AA120305) grants.

### APPENDIX

#### Derivation of P-BSHADE

As above, let *y _{i}* be meteorological station

*i*, and let

*y*

_{0}be temperature at a station to be estimated. One can estimate

*y*

_{0}by the weighted sum of the sample station record, that is, by Eq. (1). The satisfies two conditions: (i) It is an unbiased estimate of actual temperature

*y*

_{0}, and (ii) it minimizes mean square estimation error (MSEE) [Eq. (3)], so that it is a best linear unbiased estimator (BLUE). The first condition implies that

*E*() = =

*E*(), or , which leads to Eq. (6).

Concerning the second condition, MSEE is given by Eq. (7). The first term on the right of Eq. (7) is

where *V* denotes statistical variance.

The second term on the right of Eq. (7) is

and the third item is

Combining these three terms again, we have the following expression for error variance:

To minimize Eq. (A4) subject to the unbiasedness condition of Eq. (6), it is a standard constrained optimization problem that leads to minimization of the quantity (Christakos 1992)

where is a Lagrange multiplier. Next, the partial derivatives of with respect to and are set to zero.

The relationship gives the unbiasedness condition of Eq. (6).

Furthermore, differentiation with respect to the other weights produces the equations

## REFERENCES

*Principles of Geographical Information Systems.*Oxford University Press, 333 pp.

*Random Field Models in Earth Sciences.*Academic Press, 474 pp.

*Spatial Data Analysis: Models, Methods and Techniques.*Springer, 82 pp.

*Geographically Weighted Regression: The Analysis of Spatially Varying Relationships*. John Wiley & Sons, 282 pp.

*Geostatistics for Natural Resources Evaluation.*Oxford University Press, 483 pp.

*Spatial Data Analysis: Theory and Practice.*Cambridge University Press, 432 pp.

*Applied Geostatistics.*Oxford University Press, 561 pp.

*Proc. 14th Int. Joint Conf. on Artificial Intelligence,*Montreal, QC, Canada, AAAI, 1137–1145.

*Geostatistics for Engineers and Earth Scientists.*Kluwer, 303 pp.

*Spatial Data Analysis.*Science Press, 301 pp.

## Footnotes

Supplemental information related to this paper is available at the Journals Online website: http://dx.doi.org/10.1175/JCLI-D-12-00633.s1.