1. Introduction
The spatial generalization of point source observations has long been a challenging issue in climatological applications. Topographical factors, nonlinear climate gradients, and the complex mix of local and synoptic-scale forcing all serve to complicate the inherent unknowns implicit in estimating the spatial relationships of point observations. While numerous interpolation techniques have been developed either to extrapolate to new points or to derive gridded products from point source data, these are usually based on assumptions about the distance-decay function relating a point to the local region, and where possible, by inferring additional information about the spatial structure (such as through the use of standard lapse rates when interpolating temperature).
The increased application of climate model output brings a new perspective and need to this issue. There has been some discussion in the literature of whether global climate models produce area-averaged or point values. To some degree, the answer depends on the parameter being considered and the way in which it is being used. If GCM grid box data are to be averaged up to larger spatial regions it is probably appropriate to regard them as point data. If the purpose is downscaling GCM precipitation to finer spatial resolutions, it is probably more appropriate to regard them as area-averaged data (see, e.g., the discussion by Skelly and Henderson-Sellers 1996 and Osborn 1997). When developing observational datasets for model evaluation purposes, we follow Osborn and Hulme (1998) and regard climate models as producing outputs that are closer to area-average values rather than point data. Consequently model validation requires comparable observational data, necessitating some form of extrapolation from the irregular spatial distribution of station observations. The method by which this is accomplished is potentially a source of error, introducing additional difficulties in distinguishing between model error and error due to the interpolation. The critical need for the evaluation of models at the spatial and temporal scales of end user application (as opposed to the tendency in the modeling literature to report time and space averages), is an essential task for the appropriate utility of model results, not least of which is the growing use of climate models in the development of regional climate change scenarios. Thus there exists an urgent need to develop observational datasets to facilitate the growing community of end users of climate model products for climate impact assessment. In particular, estimating the area-average rainfall from point observations to evaluate the relationship of model-derived rainfall to station observations, is especially important as much of the work in the impacts community addresses issues in the hydrological components of the coupled climate system.
Over the decades numerous approaches to interpolation have been developed (e.g., Cressman 1959; Willmott et al. 1985; Biau et al. 1999), and range from simple linear interpolation to more sophisticated approaches. However, with a few exceptions (e.g., Hulme 1992, 1994; Osborn and Hulme 1997), most interpolation schemes provide an estimate of the observations at new point locations, subject to a variable set of assumptions underlying the interpolation procedure, which remains at odds with the objective of determining area averages. While some interpolation schemes have been developed for daily datasets (e.g., Piper and Stewart 1996), more often the interpolation is undertaken on time-averaged values (monthly, annual, etc.). This simplifies the task (as the spatial variability in daily atmospheric forcing is largely averaged out), but fails to address the key needs of end users for higher temporal resolution data. Osborn and Hulme (1997), for example, develop seasonal area-averaged precipitation data using an estimate of the variance as a function of number of observations in the grid box and the correlation decay length of the observing stations (a function of the correlation between pairs of stations and the separation distance between the stations). From this they derived a standard deviation for the grid box. They also describe a method for estimating rainday frequencies. This approach appears to be effective for producing seasonal comparisons and is used to evaluate the precipitation fields from 12 atmospheric general circulation models from the Atmospheric Model Intercomparison Project (Osborn and Hulme 1998).
Nonetheless, in principle, estimating daily area-average values may be accomplished, assuming a valid interpolation to a fine spatial resolution (in comparison to some desired grid resolution), and then integrating the interpolated points as representative of a spatial “response” surface. For temperature, given the spatially continuous nature and the relatively robust influence of lapse rates, such an approach seems conceptually appropriate. By contrast, precipitation (especially daily) proves to be a problematic variable, and many of the common interpolation assumptions are fundamentally flawed. For example, the assumption that the spatial representation of an observing station is proportional to, say, the inverse distance squared from the station, is questionable under convective rainfall systems. Additionally, inferences about the effects of topography, while sensible for temperature, may be very inaccurate for rainfall and may be geographically dependent. These problems arise largely out of the fact that precipitation alone among atmospheric variables is spatially discontinuous.
While any interpolation is an estimate, ideally, spatial interpolation should maximize the use of the information content of all available station data, while minimizing error from subjective decisions and assumptions. This paper seeks to identify some key issues associated with the interpolation of precipitation data, and to address some of the problematic assumptions. In doing so a new approach is developed that attempts to mitigate the potential error from such assumptions. While methodologies may always be refined and extended, we present here the core of a procedure to derive a grid-based regional-scale area-average estimate of daily precipitation from station observations, comparable (in principle) to the characteristics of precipitation produced by dynamical climate models.
2. Interpolation assumptions
Interpolation draws on the information content of the source data along with additional assumptions that may be physically justified (such as lapse-rate effects in the case of temperature). Along with the premise that the point source information has some varying association with the surrounding region, this information forms the foundation of any given interpolation procedure. Fundamental to this are assumptions about the distance–decay relationship between point observations and the surrounding spatial region—effectively determining a spatial representivity that is likely to be both directionally dependent and conditional on the prevailing weather system. Conceptually, the information content of point data with respect to the magnitude and variance reflects some variable mix of forcing, comprised of local forcing (such as vegetation state or soil moisture) and the larger-scale synoptic forcing (e.g., frontal conditions or convective states). The ratio of the local to synoptic forcing that underlies the measured point data is unknown, and is likewise directionally dependent and conditional on the synoptic forcing.
Thus to estimate area averages, a key concept is that the point observation is assumed to be a spot sample of some continuous surface responding to (dominantly) atmospheric forcing. The interpolation in effect seeks to estimate this response surface. However, precipitation adds an additional complexity in that the spatial response surface is a bounded continuum—wherein the upper limit is not fixed, and the lower limit is bounded by zero. Precipitation is therefore a spatially discontinuous surface. For example, consider two stations, one with zero rainfall and the other with a known quantity. In reality it may well be that there is zero rainfall for a large area surrounding the one station, and that actual rainfall may only begin in close proximity to the second station. However, a simple linear interpolation based on some distance weighting will interpolate a value at all increments away from the zero rainfall station, overestimating the area integral.
Thus the bounded continuum nature of precipitation requires estimating the spatial distribution of two attributes: first the phase, in other words the spatial distribution of rain/no-rain, and second, the magnitude. The former is generally not addressed by interpolation techniques. Both attributes, however, are important, and the spatial relation of both to the surrounding region is likely to be expressed by functions that are imperfectly defined by the spatial distribution of the observational data, are directionally dependent, and conditional on the prevailing weather system which determines the ratio of local to synoptic forcing under different synoptic conditions. The present paper describes an interpolation scheme (conditional interpolation or CI) that interpolates to a grid (i.e., it produces point data). By explicitly addressing the problems noted earlier, and by interpolating to a very fine spatial grid, the resulting data can be averaged to produce an area-average dataset at a resolution comparable to current numerical regional climate models. How these problems are addressed will, to a greater or lesser degree, induce error in the interpolated field. Specifically, the following may be considered as some of the principal problematic assumptions that may affect a chosen methodology:
The point observations are representative of an unbounded continuous surface. This aspect may readily lead to interpolation bias, principally evidenced as an overestimation of the spatial coverage of the rainfall.
The distance–decay function of a station’s spatial representativeness is temporally consistent. This is clearly not true. For example, on the seasonal scale one may conceive of a location where the winter is characterized by frontal activity wherein the station may have a large spatial representivity with relatively homogeneous directional attributes, versus summer where, under convective situations, the station may have a greatly reduced spatial representivity. The temporal dependency is most apparent at the daily time scales, and becomes minimized as one averages in space and time.
The distance–decay function of a station’s spatial representativeness is the same for all stations. Interpolation routines may make simple assumptions about the form of the distance–decay function (e.g., inverse distance weighting) and apply this uniformly to all stations. However, a station within regions of steep topography is likely to have a notably different relation to surrounding stations as a function of distance compared to a station in the middle of a coastal plain.
The distance–decay function is constant for all radial directions away from the station in question. A technique that estimates the spatial covariance structure over time may address this, but this alone does not consider the distance–decay function as variable in space and time.
The distance–decay function is monotonic with increasing distance. This is a common assumption. However, consider three stations, one in a valley and the remaining two on opposing hilltops surrounding the valley. Under moist airflow situations, one may conceive of orographic rainfall on the hilltops while no rain is experienced in the valley. In this case, there is a discontinuity in the distance–decay function of one hilltop station—with the strength of the spatial relationship decreasing initially to zero and then increasing again.
All stations within a finite radius of a target location have equal contribution to the interpolated value (subject to some distance–decay function) regardless of their radial distribution. Especially in sparse data (where stations may only report irregularly) this gives rise to a varying radial concentration of stations on any given day. While some interpolation schemes account for radial distribution (e.g., Willmott et al. 1985; evolving from Shepard 1968), on any given day only the reporting stations are employed, losing the spatial covariance information inherent from other stations not reporting on this day.
The intent here is not to provide a definitive comparison of relative strengths and weaknesses of each possible interpolation procedure (e.g., Kurtzman and Kadmon 1999), but rather to present an approach that endeavors to address as many of the earlier assumptions as possible in a cohesive interpolation scheme. We then implement this over South Africa, a region with a variable density of the observational network in space and time, and regional climates characterized by desert, winter, summer, and all year-round rainfall, but for which we have an extensive array of daily precipitation data.
The method presented expressly attempts to utilize the full information content of the source data while minimizing assumptions about the spatial relationships between stations—seeking to meet the objective of a best estimate of the space–time characteristics of precipitation. In particular, the technique addresses the assumptions raised earlier through the following aspects:
Allowing the interpolation parameters to be conditioned by the prevailing synoptic state. In this respect the interpolation dynamically adjusts and adapts to the spatial characteristics of the daily synoptic events.
Addressing the bounded nature of precipitation response. In this regard is it recognized that the interpolation of rainfall inherently requires two interpolations; first the interpolation of a wet or dry state (phase), and subsequently, if wet, the interpolation of the actual rainfall amount.
A distance–decay function that is not monotonic as a function of distance away from an observing station, but includes modifying the influence of geographic distance in response to the covariance structure between stations.
Accommodating the variable radial distribution of stations around the target location.
3. Interpolation framework
Before describing the interpolation approach in detail, it is perhaps useful to consider the information content of station data. The key attribute is naturally the absolute magnitude recorded. However, by considering the stations collectively we can recognize a number of additional forms of information that can be utilized:
The spatial pattern of the recorded station data is a reflection of the synoptic state. These may be generalized over time to derive common response patterns reflecting “types” of synoptic-scale atmospheric forcing. This is a valuable attribute as atmospheric data to define the prevailing weather system may not be readily available, may be erroneous, or nonexistent.
For all occurrences of a given synoptic forcing type, the time average of the wet/dry relationship between stations is a measure of the in-phase response of two stations to the synoptic forcing state. Given the spatial field of such a phase relationship measure between one station and all surrounding stations, and knowing the wet/dry state at stations, one may infer a probability field of phase state spanning the area between the one station and its neighbors.
The phase index in 2 may be of further value in modifying the weighted influence of the geographic distance between stations. In this manner, a station of very similar phase response to another is “closer” by virtue of the phase similarity than a third station that may be geographically nearer, but have a poor phase similarity.
As in 2, for all occurrences of a given synoptic forcing type, the time average of the difference in precipitation magnitude between stations reflects the systematic bias between one station and those surrounding it. The spatial field of this bias is then of value in estimating the scalar factor at locations in between stations.
Information on the radial distribution of stations around a given location may be usefully incorporated to limit bias from increased station density in one sector.
The interpolation methodology presented next seeks to utilize the maximum information content in the precipitation dataset, and in the process minimize the potential sources of error arising from the issues and assumptions outlined earlier. For the interpolation we use daily station observations of precipitation over South Africa. This is a collation of all available station data from the national weather service and from agricultural and water resource networks. The data and quality control procedures are described by Lynch (2002). While the station density is variable in space and time, it is sufficient that for a 0.1° target grid there are always between 5 and 200 stations within a 0.75° radius of each grid point.
In order to maximize the information content in defining synoptic states (seeking to maximize the accuracy of the estimation on any given day) we retain in the dataset any station with at least a (subjectively determined) 5-yr reporting presence in the period of interest (1950–99). In this manner 5921 stations are available for the interpolation over an area of 1.5 million km2. Figure 1 shows the spatial domain for the analysis, where each dot represents a reporting station retained (top panel), and the number of reporting stations at any given time over the 50-yr period (bottom panel). Of note is the trailing off of the number of stations in the latter period that potentially could influence trend in the interpolated data. This, however, forms the topic of discussion in a subsequent paper on trend analysis of the interpolated dataset.
4. Interpolation methodology
The interpolation procedure is divided into a number of steps as follows:
Establish a grid of target locations for interpolation.
Define a set of generalized synoptic states for each input station.
Construct a phase relationship and bias field around each station, for each synoptic state.
Then, for each day:
(a) Interpolate to target locations the phase index from each of the surrounding stations conditional on the synoptic state, the summation of which determines a wet (>0.5) or dry (<0.5) condition at the target location.
(b) If a wet condition is inferred, interpolate the magnitude of precipitation taking cognizance of the bias fields and radial distribution of stations.
a. Target grid
In view of the objective to derive area-average precipitation, the target grid should be of sufficiently high resolution (spatial density) to facilitate subsequent spatial integration to area averages comparable to model resolutions. Most regional climate model (RCM) simulations currently have resolutions of the order of 30–60 km, and thus an interpolated product should be of some significantly higher resolution. In this example we have selected a 0.1° (∼10 km) grid. This is a subjective decision based on the desired area-averaged resolution. The accuracy of the final area-averaged product will vary as the target grid resolution changes—degrading as the target resolution approaches the final area-averaged resolution. There is likely to be some optimum target resolution that could be determined by repeating the analysis at multiple resolutions. This is likely to change, however, as a function of region and season. In this case, we simply chose a 0.1° (∼10 km) grid, which would give about 10–40 points to average for a typical RCM data grid. From here on the procedural discussion focuses on a single target location, as the method is simply replicated for all other targets.
b. Selection of observational stations
For each target location, all neighboring stations are identified. A subjective search radius is defined, more to limit the computational demands than for determining the accuracy of the final product. As outlined later, only the N-closest stations are used in any given interpolation, and hence this is a pragmatic decision, not deterministic of the final outcome. Given that stations have variable presence in time, this step simply determines the pool of potential source data for interpolation to a given target on any given day. In this example we define a radius of 0.75°, which for any target location results in identifying between 5 to more than 200 surrounding stations.
c. Definition of synoptic states
Earlier it was noted that the relation of a station to its surroundings is conditional on the nature of the synoptic state of the atmosphere. In this it is recognized that some synoptic events are large scale and dominate the regional response of precipitation across many stations, while in other cases the synoptic events, although establishing the environment for the regional weather, are secondary to smaller-scale rainfall producing processes such as thunderstorms. Consequently, it is desirable to undertake interpolation with explicit recognition of the synoptic state and the influence on the characteristics of regional rainfall.
The logical approach would be to utilize the synoptic field of some atmospheric variable(s), for example, sea level pressure. However, this introduces a constraint that one may condition the interpolation only when adequate atmospheric data are available. While the advent of global gridded reanalysis data, currently extending back to 1947, has to some degree made this possible, these datasets are nonetheless still limited in the temporal extent. In addition, for many parts of the world these data are problematic prior to the introduction of satellite observations, due to the sparseness of surface observations over the ocean and many land areas that are required to constrain the GCM used for the reanalysis. Tennant (2002), for example, demonstrates a significant discontinuity in the National Centers for Environmental Prediction (NCEP; Kalnay et al. 1996) reanalysis data over the Southern Hemisphere around 1979 (the advent of satellite data for the reanalysis). In contrast, the station precipitation record in some locations may be considerably longer than the available circulation data.
Nonetheless, as discussed earlier, the spatial distribution of rainfall is in itself a reflection of the synoptic state. As such, it would seem reasonable to use the precipitation spatial patterns as a basis for identifying common response patterns to synoptic forcing. Consequently, categorizing the local domain precipitation field (in this paper, the radius of 0.75° surrounding each station) provides a basis for inferring the synoptic state forcing the precipitation. The only constraint then being that, on any given day, there are enough stations reporting within the region of interest to adequately characterize the pattern response to the synoptic state.
The area of climate studies often referred to as “synoptic climatology” has a rich heritage of algorithmic approaches to stratifying data into types. The literature is extensive, and is well summarized in Yarnal (1993). Most of these techniques, however, are related to standard clustering methods, which in themselves incorporate significant subjectivity in determining clusters or synoptic types (e.g., Key and Crane 1986). A new approach to this task is presented by Hewitson and Crane (2002), based on the technique of self-organizing maps (SOMs). This approach has the following advantages over more traditional clustering algorithms, and it:
Maintains coherence of the derived clusters as one scales from a few clusters with a high degree of generalization to many clusters with finer differences.
Explicitly approaches the data as a continuum without hard categorical boundaries.
Maintains robustness in the presence of missing data elements.
Provides relative insensitivity to any subjective implementation decisions.
Provides a powerful visualization approach for examining data structures.
We use the SOM approach here to derive the generalized precipitation spatial response patterns to the atmospheric forcing, and implicitly accept these as reflections of the synoptic conditions. Full details of the SOM methodology are outlined in Hewitson and Crane (2002), and here we confine ourselves to a simple analogy of the procedure for explanatory purposes.
Effectively, a SOM identifies archetypal points within the data space.1 In doing so it identifies locations that span the continuum of the data, selecting more archetypes where there is supporting data to warrant it, and fewer where the information content of the data is sparse. The archetypes span the continuum of the dataset and have a clear relationship to one another. In this example the data are the precipitation time series (daily) in N dimensions where N is the number of stations used to define the spatial patterns. Each daily pattern across the N stations may then be associated with the closest archetype, in effect clustering the data observations. This is notably different from many traditional clustering techniques as the clusters are not discrete, nonoverlapping groups; all data points contribute to the definition of all clusters, and the clustering takes place in a postprocessing phase and is not inherent in the SOM itself.
One may consider the process as somewhat akin to a manual classification where, if presented with N synoptic patterns, one could classify these into an array of X-by-Y identified patterns by placing similar patterns together, and where each group is a marginal departure in similarity from adjacent piles. At the end, one would have an array of classified observations where the physical position of groups in the X-by-Y array indicates position in the continuum of states. Similar groups are located close to each other, while very different groups are further apart in the array. The final product is a set of generalized states spanning the continuum of the observational data, with all days clustered in terms of these states.
For this application we undertake classifications of the spatial patterns of precipitation around each and every station, as defined by the set of surrounding stations within the established radius. In the SOM analysis the data are represented as the logarithm of the precipitation in order to avoid biasing the procedure by infrequent high-magnitude rainfall events. Selecting one observing station Oi and the set of surrounding stations within the defined radius, a time series of vectors is generated where each vector is the set of reported precipitation across the stations for a single day. On any given day there may be any number of missing data, and a subjective minimum of five reporting stations is used in order for the day to be included in the analysis. In nearly all cases many more than this minimum are present, and it is noted that the SOM is robust in handling missing data—working effectively with a subsample of the stations in estimating where in the data space the observation vector lies.
For this example an array of 6 by 4 archetypes, or SOM nodes is presented, allowing for 24 generalized synoptic states. Smaller or larger SOM arrays may be used, however, subjectively, 24 states are considered adequate to capture the range of modes. The SOM procedure then finds archetypes spanning the data space, such that the two diagonals of the node array are analogous to the first two principal components, or EOFs, of the data (although they are not necessarily orthogonal to each other). The nodes in between represent the transition states between these extremes. For precipitation data this results in the one axis capturing the very dry to very wet states, and the second axis captures spatial variation in the dominant precipitation patterns. Figure 2 shows the archetypes of precipitation patterns around one station (station Oi, located at the center of each of the node grids). It can be clearly seen that each node is similar to adjacent nodes, and reflects a continuum of very diverse precipitation response patterns. Note that in this case no days actually mapped to any of the nodes in the last column of the array and the figure shows only the first five columns of the SOM array.
Once the spatial classification is derived, the number of days assigned to each node is computed to produce a two-dimensional histogram of precipitation patterns. Figure 2 also includes the derived frequency for each node. The frequency product can afford valuable insight into the data, as shown by Hewitson and Crane (2002), or Tennant and Hewitson (2002).
d. Determination of phase and bias spatial relationships for a station
e. Weighting procedures
Following this, spatial interpolations are undertaken of the PMI and BSI parameters, and ultimately interpolation of the precipitation amounts. The interpolation scheme employed is essentially that described by Willmott et al. (1985), with an additional modification described later. The basic interpolation is not complex, and implements the following procedure:
- Determine a weighting function based on distance. The distance function is based initially on geographic distances, which is a simple inverse distance for the 1st third of a defined radius from the target, and thereafter proportional to the inverse distance squared, decreasing to zero at the defined radius. Thus, for a defined radius R around station Oi, and a nearby station Oj within R but separated with a distance dij, the distance weight sij for station Oj is defined as If dij> < R/3,If dij> > R/3 and dij> < R,The preceding function produces a weighting function as shown in Fig. 4. For comparison a standard inverse-distance squared function is included in the figure. The function shown here has the added characteristics of decreasing to zero at a distance R, with a slower decrease in influence for stations in close proximity to the target. This distance weighting may be further modified if additional information about station “distance” is available. Thus, as described later, the PMI is used to modulate the distance, as the PMI is an additional measure of “closeness” in terms of common response.
- A second set of weights (tij) are then defined that account for the radial distribution of the stations in relation to the target location (again based on Willmott et al. 1985). For a given station Oj within the radius of station Oi, the directional isolation tij of nearby station Oj is a function of the average cosine of the angles subtended at Oi by Oj with each of the other nearby stations (Ok). The distance weight calculated earlier is included to accommodate stations that are in close radial alignment, but where one is far from the target and the other close to the target. Directional isolation weight:The (1 − cosθjik)/2 factor ranges between 0 and 1.0 and acts as a weighting function to the distance weight derived earlier. Thus two coaligned stations would have a directional isolation factor of 0.0, two stations subtending a 90° angle would have a factor of 0.5, and diametrically opposed stations would have a factor of 1.0. The average of these values across all other neighbor stations thus reflects a measure of directional isolation of a given station Oj.
- A composite weighting factor of the distance weight and the directional isolation factor is then determined, Wj for station Oj and defined as
- Having determined the set of weights contingent on the distance/radial distribution for the n surrounding stations, and given a value at each surrounding station (V), the interpolated value (I) at the target location is defined as
In practice, while there may be many stations within R of the target we impose a limit of using the 10 closest reporting stations. If, as in a few cases, there are less than 10 reporting stations within R, then a minimum of three reporting stations is specified. Less than this we take a conservative approach and assign an undefined value to the target.
f. Determining phase state (wet/dry) at each target location
The first step toward interpolating precipitation to a target location is now to identify, for each day, the phase state at the target—is it wet or dry? A target location is surrounded by a finite set of irregularly spaced observations (Ok). Each of these (Oi) is characterized by a unique spatial field of the PMI that relates this station to the region (conditional on the synoptic state). The first step is thus to interpolate the PMI of Oi for the current synoptic state to the target location.
The local PMI has been independently calculated for every station with respect to its neighbors, and one has, therefore, a set of locally conditioned overlapping phase index surfaces that provide information on the strength of the phase relationship between stations in the local area. The PMI field for each station (Oi) is interpolated to the target point to compute a probability of whether the target will have the same phase as Oi. For the phase state on a given day at Oi, the wet/dry state probability at the target location is determined by assigning a value of (+1.0 * PMI) if precipitation is reported at Oi, or (−1.0 * PMI) if no precipitation is reported. The summed value of (+1,−1)*PMI at the target from all surrounding stations indicates the probability of it being wet or dry at the target, in accordance with the strength of phase relation to the surrounding stations. Consequently, an interpolated value at the target location > zero indicates a greater than 50% chance of it raining at this location, and vice versa for a value < zero. Across all target locations this defines the spatial extent of the bounded precipitation surface. Figure 5 shows the reported station precipitation values over a small subdomain for a given example day, and the derived phase state at all target locations, demonstrating that realistic spatial boundaries for the precipitation have been determined.
g. Interpolating precipitation magnitude
Having determined the spatial extent for wet and dry states, it remains only to determine the bias factor for each station in relation to the target, and apply this in determining the magnitude of the precipitation at locations identified as “wet.” For each station surrounding the target, each station’s bias field is interpolated to the target using the same approach as earlier—determining a BSI factor between the target and each of the neighboring stations, conditional on the synoptic state.
The relevance of this step is tied to the fact that interpolation works from the absolute magnitude of the station data. Hence, if two stations both report X mm of rainfall, and one uses regular linear interpolation to interpolate to a location between the two stations, the resultant value will be something approximating X mm (dependent on the interpolation procedure). However, in reality, it is likely that the target location has, over time, a systematic bias in relation to each of the stations. Hence the simple interpolation would introduce (potentially large) errors to the interpolated value—artifacts arising from excluding the information inherent in the bias field, and the magnitude of the error subject to the synoptic state. Thus interpolating the precipitation magnitude requires adjusting the reported values at each station by the BSI value relative to the target location.
Following the preceding, an additional modification is introduced whereby we adjust the calculated distance weight between target and station (dij, defined earlier) as a function of the relevant PMI. This adjustment reflects the fact that geographic distance alone is not an ideal measure of the relevance of a neighboring station’s data, that is, that the distance decay may be nonmonotonic. As it stands, the distance weight varies from 1.0 (dij = 0) to 0.0 (dij = R). We modify this by adding to dij a PMI-dependent proportion of the difference between the geographic dij and the radius of influence (R; the distance at which the weight dij equals zero). This effectively increases dij towards the limit R, resulting in a weighting function that reduces the influence of this station in direct proportion to the strength of the PMI. Formally, we first convert the PMI (ranging between 1.0 and +1.0) to a scaling factor (α) between 0 and 1. This scaling factor is then used to add a proportion of distance between dij and R to the distance dij to derive a modified distance d′ij.
h. Collocation of target and stations
A final consideration is to accommodate situations where a station is collocated with the target, or closer to the target than half the target grid increment (TGI/2—effectively the target grid cell boundary relative to the target location). As it stands, the distance–weighting function increases to infinity as dij approaches zero. If the intention is to interpolate point data then in principle this is perhaps correct. However, in attempting to estimate gridded cell average precipitation—reflecting the integral of the response surface continuum—one needs to consider the implications of multiple stations within the grid cell boundaries of the target location. For example, consider one station collocated with the target (the center of the grid cell), and a second station a mere fraction of the grid increment farther away. It is sensible to expect that the second station should have a contribution to the interpolated value for the grid cell, yet as matters stand, the inverse distance weighting leads to the collocated station having infinite influence. To address this we consider that each station within TGI/2 of the target location should have equal influence—in effect stating that we interpolate a value for a target grid cell, and not a target grid point. This is accomplished by evaluating each dij, and where less than the TGI/2 we set dij to the target grid cell boundary.
With the preceding modifications to the station precipitation and interpolation weights, the precipitation is now interpolated to those target locations where the phase state has been determined to be wet. Figure 7 shows the resulting precipitation from the earlier procedure for the same day and domain as used in the Fig. 5.
In summary, the earlier procedure performs an interpolation whereby an initial estimate is made of the wet/dry state at the interpolation point (addressing the bounded surface issue), the distance metric reflects the covariance structure appropriate to the synoptic state, and the radial distribution of source data is accounted for.
5. Assessment of interpolated fields
The assessment of interpolation always presents a problem in that objective comparative data is not available. One possibility is cross-validation, wherein one removes one station from the procedure, interpolates to the same location and compares the interpolated value with the station held back. However, this assumes first, that one is interpolating to a point, whereas in the procedure presented here the intent is explicitly to estimate the regional response surface of precipitation. Second, we recognize that an observation represents a magnitude that is a composite of the regional response plus some additional variance unique to the station and not reflected in neighboring stations—an independent component of the magnitude that no interpolation procedure can possibly determine. Consequently, in cross-validation under this situation there is no basis to expect the interpolated value to match the station removed, and in fact it would be unusual to find the station perfectly matched by the interpolated value. Further, there is no way to disaggregate any mismatch into the error due to the interpolation, or differences due to real independent variance in the station data.
Nonetheless, some forms of assessment are still possible. First we look at the two case samples from Figs. 5 and 7, then do a more general comparison of the present technique’s ability to produce accurate estimates of rainfall extent with that derived from the Cressman interpolation. The case example is from 17 January 1960, reflecting summer convective rainfall over a domain that spans the coastal margin through to high elevations. It is apparent that the interpolated phase-state field reflects the observations closely, and defines a wet/dry boundary between wet and dry stations. Similarly, the interpolated target grid cell precipitation values closely reflect the station observations lying within and close to the target grid cells. Overall the spatial gradients in rainfall, as evident from the station observations, are well captured by the interpolated surface. Subjective visual examination of multiple days (not shown) from all seasons and for different domains shows the same degree of performance from the interpolation routine.
Note that the technique as presented does not explicitly account for topographic influences (as may often be the case in some interpolation procedures in current use). However, Fig. 8 shows the topography for the example domain and demonstrates that, while some topographic dependency of station location is apparent, the density and distribution of reporting stations is such that much of the topographic forcing is effectively captured by the PMI/BSI. This may simply be fortuitous in that we began with a high station density and enough stations to capture some of the topographic variability. This may not be the case in other areas (the western United States for example). In that case, provided that there are some stations in the dataset that represent the range of elevation and aspect in the region, additional parameters analogous to the PMI/BSI could be developed specifically to account for synoptically conditioned topographic relationships. Otherwise, as with any other interpretation procedure, the investigator must either neglect topographic effects, or develop some other empirically derived elevation-dependent adjustment to the interpolated product (e.g., Daly et al. 1994; New et al. 1999). Such adjustments are likely to be domain dependent. Consequently, in the interests of presenting the methodology in a domain-independent context, this extension is not included here.
The case example may be further extended by comparison with another interpolation procedure. In this case we present an interpolated surface from the same station data using Cressman interpolation (Cressman 1959). This procedure is in common use (e.g., CPC U.S. Unified Precipitation data; available from the Climate Diagnostics Center, Boulder, Colorado; see Web site at http://www.cdc.noaa.gov) and is a standard function in the popular Grid Analysis and Display System (GrADS) visualization software package (see web site http://grads.iges.org). Cressman interpolation uses repeated passes with successively smaller radii in order to make corrections to an initial estimate. In this example we specify five iterations, with the smallest radius defined as the same radius of influence used in the conditional interpolation (0.75°) in order to provide a basis for comparison.
The results for the example day are presented in Fig. 9. Two differences between the interpolated surfaces are immediately apparent. First, the Cressman procedure interpolates rainfall (wet state) to many more target locations than the conditional interpolation—locations where it is clear that a wet state is inappropriate. Second, where a station lies in very close proximity to the grid cell center (interpolation target), Cressman interpolation interpolates a value very close to the single station. However, this is often in conflict with adjacent stations where minimal, or even no rain is reported, and consequently, assigning the single station value to the entire grid cell is inappropriate. As discussed earlier, a station magnitude is comprised of both a regional signal, and a local component of unknown spatial relevance. Thus, an interpolation to a grid cell should not be perfectly tied to the reported value of the closest station, but also take into account the surrounding values and gradients. The conditional interpolation procedure only interpolates grid cell values close to the reporting station maximums when the information at all close surrounding stations support this—consequently this may be considered a better reflection of the regional response surface.
Examining this example in closer detail, Table 1 shows additional statistics of the total spatially integrated rainfall, the number of grid cells interpolated as wet, and the maximum rainfall in the domain. The total integrated rainfall for both procedures is similar, with only 5.7% difference. However, the Cressman procedure has 17% more grid cells receiving precipitation, and a 28% higher maximum. These values are only specific to the example day in question. Nonetheless the degree to which Cressman overestimates the areal extent of rainfall is illuminating, considering the two procedures have nominally the same area integral.
The preceding example is specifically chosen from the summer convective rain season to highlight the interpolation under a convective rainfall event. If one considers an opposing situation from an alternative seasonal state, the bias between the procedures is even more marked. For a winter day (23 July 1984), representing an event of significantly reduced magnitude, the differences are even more evident (Table 2). In both cases Cressman overestimates the number of cells with rainfall. In the convective case, rainfall is more localized and Cressman simply overestimates the areal extent. In the frontal situation, rainfall is again localized, but precipitation centers are distributed over a much larger area and Cressman overestimates the spatial extent of each of these centers, resulting in a much larger overestimate of the areal extent of rainfall. We attribute this principally to the fact that the Cressman procedure does not have available the information content of spatial relationships between stations as a function of weather state, nor expressly accommodates the bounded nature of the precipitation surface. The earlier single-day case study can be further extended to assess the 50-yr climatological mean differences and obtain more robust estimates of the biases between the two procedures. In this case we calculate the 50-yr summation of area-integrated rainfall across the example domain, and the total number of wet grid cells (Table 3). Again, the most striking aspect is the difference in spatial extent of interpolated precipitation.
A more comprehensive evaluation of the two interpolation schemes is achieved by looking at the wet/dry determinations. Using the station data, for each grid cell on each day we have classed the grid cell as dry (no precipitation on any station within the grid cell) or wet (any, even trace, precipitation at any of the stations within the grid cell). Similarly, the Cressman interpolated and conditional interpolation (CI) data are classed as wet (> zero precipitation) or dry (zero precipitation) for grid cells coincident with those classed from the station data. Note that this excludes some grid cells that fall into interpolated regions between station observations, and this comparison only uses grid cells where station observations are present within the grid cell borders.
Further, this comparison ignores magnitude and focuses on the ability of the interpolation to capture the grid cell dry/wet state from the station information—recognizing that individual stations within the grid cell may not ideally represent what may be subjectively considered the consensus of the regional precipitation response. For example, we see in Fig. 7 examples where one isolated station shows a wet/dry state in conflict with the mass of surrounding stations, and we argue that a derived gridded area-average product should be more aligned with the regional consensus.
Following this, we make two comparisons of the interpolation schemes. First, comparing the total number of interpolated dry grid cells with the number of station-determined dry grid cells, and similarly the total number of wet grid cells compared to the number of station-determined wet cells. Table 4 clearly shows the propensity for Cressman interpolation to overestimate the wet state and underestimate the dry state. In contrast the CI approach estimates the dry state almost perfectly, while apparently underestimating the wet state by 13%. However, considering the argument presented earlier—that a gridded area-average product should be a consensus of the regional state, this apparent underestimation may in fact be more realistic of the regional state.
As a second comparison, we compare the number of occasions where the interpolated values represent the incorrect state (dry when should be wet, or wet when should be dry) in comparison to the station-determined categories (Table 4). These results support the same conclusions as earlier, that Cressman interpolation creates too many wet states, while CI has an underestimation of wet cells, although this is arguably only an apparent underestimation in the context of the area average. Finally we compare the Cressman-determined states directly with those from the conditional interpolation. Here Cressman interpolation produces only 51.2% of the dry states compared to the CI procedure, and has 397.7% of the wet states produced by the CI procedure.
6. Summary and conclusions
While the description of the methodology may appear complex, the interpolation scheme actually follows a few relatively simple steps. In summary,
For each observing station, categorize the synoptic circulation using as a proxy the local precipitation distribution and assign each day in the record to a particular synoptic state.
For each station, and for each synoptic state, calculate the local phase relationships (PMI).
Again, for each station and each synoptic state, calculate the bias relationships between the station and its surroundings (BSI).
For each day in the record, interpolate to each target the PMI from all surrounding stations and make a rain/no-rain decision.
For each target location where rain is indicated, interpolate a precipitation amount, where the distance–weighting function is modified by the PMI, with the source station data magnitudes adjusted by the BSI.
This procedure is presented to accommodate the growing need for daily observational precipitation datasets to support research, specifically in relation to the use of gridded climate models. Conventional interpolation techniques suffer from not recognizing the changing spatial representivity of stations as a function of the driving synoptic state. Moreover, with regard to precipitation, such techniques also fail to recognize the bounded nature of the precipitation field—that the precipitation field is spatially discontinuous. Further, interpolation techniques explicitly estimate values at new point locations, and do not directly address the need arising from climate modeling for area-average values.
Conditional interpolation explicitly accommodates many of the assumptions underlying more traditional interpolation approaches, and specifically recognizes that point-scale observations represent a mixture of synoptic forcing shared in common with surrounding stations, and a response that is unique to the station. Consequently the spatial representivity of a station is conditional on the synoptic forcing and a function of the radial direction from the station. The conditional interpolation accommodates this through conditioning the interpolation parameters as a function of the synoptic state. In a two-stage process the spatial pattern of wet/dry conditions is initially estimated, following which the magnitude of the precipitation is derived at those locations determined as “wet.” By taking into account the synoptic dependence and the bounded nature of the precipitation function, this technique offers significant advantages for interpolating daily precipitation values to an areal grid. This technique is relatively simple to operationalize and could be applied anywhere—recognizing that some regions may have to include a more explicit topographic adjustment scheme, and that the overall accuracy may be lower in regions with a lower density of observing stations. For South Africa, the conditional interpolation estimates the spatial extent of the precipitation field well, and derives gridded values representative of the area average. In comparison with the commonly used Cressman approach, both these characteristics appear to be significantly overestimated by Cressman interpolation. Overall the interpolation conditioned by the synoptic state appears to better estimate realistic gridded values appropriate for use with model simulation output.
REFERENCES
Biau, G., E. Zorita, H. von Storch, and H. Wackernagel, 1999: Estimation of precipitation by kriging in the EOF space of the sea level pressure field. J. Climate, 12 , 1070–1085.
Cressman, G. P., 1959: An operational objective analysis system. Mon. Wea. Rev., 87 , 367–374.
Daly, C., R. P. Neilson, and D. L. Phillips, 1994: A statistical-topographic model for mapping climatological precipitation over mountainous terrain. J. Appl. Meteor., 33 , 140–158.
Hewitson, B. C., and R. G. Crane, 2002: Self organizing maps: Application to synoptic climatology. Climate Res., 22 , 13–26.
Hulme, M., 1992: A 1951–80 global land precipitation climatology for the evaluation of general circulation models. Climate Dyn., 7 , 57–72.
Hulme, M., 1994: Validation of large-scale precipitation fields in general circulation models. Global Precipitation and Climate Change, M. Desbois and F. Désalmand, Eds., Springer-Verlag, 387–406.
Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Reanalysis Project. Bull. Amer. Meteor. Soc., 77 , 437–471.
Key, J., and R. G. Crane, 1986: A comparison of synoptic classification schemes based on “objective” procedures. J. Climatol., 6 , 375–388.
Kurtzman, D., and R. Kadmon, 1999: Mapping of temperature variables in Israel: A comparison of different interpolation methods. Climate Res., 13 , 33–43.
Lynch, S. D., 2002: The development of an improved gridded database of annual, monthly, and daily rainfall. WRC Project Rep. K5/1156/0/1, Water Research Commission.
New, M., M. Hulme, and P. Jones, 1999: Representing twentieth-century space–time climate variability. Part I: Development of a 1961–90 mean monthly terrestrial climatology. J. Climate, 12 , 829–856.
Osborn, T. J., 1997: Areal and point precipitation intensity changes: Implications for the application of climate models. Geophys. Res. Lett., 24 , 2829–2832.
Osborn, T. J., and M. Hulme, 1997: Development of a relationship between station and grid-box rainday frequencies for climate model evaluation. J. Climate, 10 , 1885–1908.
Osborn, T. J., and M. Hulme, 1998: Evaluation of the European daily precipitation characteristics from the Atmospheric Model Intercomparison Project. Int. J. Climatol., 18 , 505–522.
Piper, S. C., and E. F. Stewart, 1996: A gridded global data set of daily temperature and precipitation for terrestrial biospheric modeling. Global Biogeochem. Cycles, 10 , 757–782.
Shepard, D. S., 1968: A two-dimensional interpolation function for irregularly-spaced data. Proc. 23d ACM National Conf., ACM Publication P-68, 517–524.
Skelly, W. C., and A. Henderson-Sellers, 1996: Gridbox or grid point: What type of data do GCMs deliver? Int. J. Climatol., 16 , 1079–1086.
Tennant, W. J., 2002: Event characteristics of intra-seasonal climate circulations. Ph.D. thesis, University of Cape Town, 173 pp.
Tennant, W. J., and B. C. Hewitson, 2002: Intra-seasonal rainfall characteristics and their importance to the seasonal prediction problem. Int. J. Climatol., 22 , 1033–1048.
Willmott, C. J., C. M. Rowe, and W. D. Philpot, 1985: Small-scale climate maps: A sensitivity analysis of some common assumptions associated with grid-point interpolation and contouring. Amer. Cartogr., 12 , 5–16.
Yarnal, B., 1993: Synoptic Climatology in Environmental Analysis. Belhaven Press, 195 pp.
(top) The spatial domain of study and the location of reporting stations. The inset square is the domain that focused on in results presented later—a region of complex topographical forcing. (bottom) A time series of the number of reporting stations on any given day.
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
Example of SOM-derived archetype precipitation fields (mm) in relation to station Oi. Of note are nodes such as (x = 2, y = 1) showing dominantly localized rain at Oi, compared to (x = 1, y = 2) where the station has similar rain as the broad region to the northwest. Node (x = 4, y = 4), on the other hand, shows a situation where rain at the station is minimal while significant rain occurs to the southeast. The boxed numbers on each node reflect the frequency of occurrence (%). Node (x = 5, y = 2) is the dry state across the entire region, and hence has a very high frequency of occurrence in comparison to the other nodes. Darker shades are higher precipitation (contour values excluded for clarity).
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
The array of PMI fields on each SOM node for the example station Oi. Note node (x = 5, y = 2), which in Fig. 2 is the node with zero precipitation across the region. For this node the PMI is a constant 1.0. The rest of the nodes clearly show how markedly different the phase relationship between the station Oi and the region can be. For example, contrast node (x = 4, y = 1) with node (x = 1, y = 3). Darker shades are degrees of phase match (contour values excluded for clarity).
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
Comparison of the CI weighting scheme with a standard inverse distance and inverse distance squared weighting, for a specified radius of 0.75.
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
Example of calculated PMI values on the target grid for one day (17 Jun 1960). Plotted as numbers are the precipitation amounts (mm) of the reporting stations on this day. White indicates a PMI value of ∼0 (indeterminate wet/dry state), red is PMI > 0 (wet state), and blue is PMI < 0 (dry state).
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
The effect of modifying dij as a function of varying PMI from −1 (inverse relation) to +1 (perfect in-phase relation).
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
The CI precipitation for the same day and domain as in Fig. 5. White grid cells indicate zero precipitation. Total spatially integrated precipitation = 5274 mm. Maximum precipitation = 64 mm. Number of grid cells receiving rain = 216.
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
Topography (m) and the location of stations used in the interpolation. The domain encompasses a coastal plain (ocean to the east) onto an escarpment leading to the continental interior.
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
Cressman-interpolated precipitation (mm) for the same day and domain as in Fig. 7. White grid cells indicate zero precipitation.Total integrated precipitation = 5572 mm. Maximum precipitation = 81 mm. Number of grid cells receiving rain = 252.
Citation: Journal of Climate 18, 1; 10.1175/JCLI3246.1
Statistics of interpolated rainfall for a winter rainfall event on 23 Jul 1984.
The 50-yr statistics of interpolated rainfall for the domain in Fig. 9. Calculated as the summation of the area integral of precipitation magnitude and number of wet grid cells over the full 50 yr of interpolated data.
Comparison of observations with the wet and dry decisions from the Cressman and the conditional interpolation schemes.
It is important to avoid confusion here. We are speaking here of data dimensions, not geographic spatial dimensions.