1. Introduction
Gridded rainfall datasets are essential for various applications in the geosciences, including climate and hydrologic modeling, decision-making in environmental management, water resource planning, and weather forecasting. In Hawai‘i, a number of gridded rainfall products that cover a range of temporal and spatial resolutions have been produced over the past decade. Available gridded rainfall products include climatological rainfall maps at 250-m spatial resolution, based on the period encompassing 1978–2007 (Giambelluca et al. 2013), month-year rainfall maps for the period 1920–2012 (Frazier et al. 2016), daily rainfall maps for the period 1990–2014 (Longman et al. 2019), and a 100-member rainfall ensemble at 1-km resolution (Newman et al. 2019b). While these products have been utilized in many different ways, their end dates are a limiting factor in examining rainfall trends and phenomena that have emerged in recent years (e.g., Lucas et al. 2020; Frazier et al. 2018; Frazier and Giambelluca 2017; Krushelnycky et al. 2016; Mair et al. 2019; Frauendorf et al. 2019).
In this work, available rainfall observations are used to develop monthly gridded estimates of rainfall at 250-m spatial resolution for the period 1990–2019 using an optimized ordinary kriging approach. Kriging is commonly used for spatial interpolation of rainfall at various spatial scales and temporal resolutions (Hattermann et al. 2005; Buytaert et al. 2006; Berezowski et al. 2016; Brinckmann et al. 2016; Frazier et al. 2016) but requires a significant amount of point data to achieve accurate results. To quantify spatial autocorrelation structure in the form of an empirical semivariogram, as required in kriging, at least 100 measurement pairs (ideally 150) are needed (Webster and Oliver 2001) for the interpolation.
In Hawai‘i, previous work has demonstrated that ordinary kriging is an effective way to map rainfall over complex topography at the monthly time scale (Mair and Fares 2011; Frazier et al. 2016). Frazier et al. (2016) compared three kriging algorithms: ordinary kriging, ordinary cokriging, and kriging with an external drift (also termed, “universal kriging”; see Hengl et al. 2007), and found that ordinary kriging produced the lowest errors. In their approach, these authors applied a climatologically aided interpolation (CAI) (Willmott and Robeson 1995) method, which combines long-term climate information with monthly station data to develop the monthly gridded estimates. This allows for information propagation from the climatological patterns, based on a denser climatological station network and the use of ancillary information, through to the monthly fields. In Hawai‘i, the climatological network is much denser than the monthly observation network and more completely resolves the steep rainfall gradients associated with orographic processes (Giambelluca et al. 2013; Frazier et al. 2016). The long-term climate mean maps (Giambelluca et al. 2013) used a Bayesian data fusion method to combine rain gauge data with radar rainfall estimates, mesoscale meteorological model output (MM5), Parameter-Elevation Regressions on Independent Slopes Model (PRISM) maps (Daly et al. 1994), and vegetation-based rainfall estimates to improve the overall accuracy of the product. These methods, however, cannot be reproduced for month-year mapping because the predictor variables (vegetation, PRISM, MM5, and radar rainfall) are not available at a monthly temporal resolution.
In CAI, departures from the mean (relative anomalies) on a given month are interpolated using ordinary kriging and then combined with a mean map to produce the final monthly map (Dawdy and Langbein 1960; Willmott and Robeson 1995; Frazier et al. 2016). This approach allows information about complex rainfall patterns found in the long-term mean map to be incorporated into the final product. The CAI approach has also been shown to produce better results than interpolating absolute rainfall values at the global scale (New et al. 2000) and in Hawai‘i (Newman et al. 2019a).
Other methods such as universal kriging and linear regression that incorporate elevation-dependent covariates to map rainfall at monthly and annual time steps have outperformed ordinary kriging elsewhere (see Goovaerts 2000; Bostan et al. 2012). The fact that the ordinary kriging CAI approach has been able to perform so well in Hawai‘i is explained by two factors: 1) covariate information will only improve the precision of the interpolation if the primary variable is undersampled (i.e., sparse station network, which is not the case in Hawai‘i), and 2) by interpolating anomalies instead of the raw rainfall values, most of the information about the surface (e.g., elevation-dependent features) are already incorporated into the model.
The presented paper summarizes the data collection efforts and the methods used to create a 30-yr gridded time series (1990–2019) of monthly rainfall maps in Hawai‘i using a CAI-ordinary kriging approach. This current endeavor builds on a long history of mapping rainfall in Hawai‘i and improves on previous products in its time period of coverage, the inclusion of new observation stations and automated approach to optimize the kriging model. Note that precipitation in Hawai‘i consists of rainfall, different types of frozen precipitation (e.g., snow, sleet, hail, and freezing rain), and fog drip (Giambelluca et al. 2013). For consistency with previous efforts (Giambelluca et al. 2013; Frazier et al. 2016; Longman et al. 2018, 2019), the term “rainfall” is used throughout the paper, as other types of precipitation (e.g., snow, fog drip) are not explicitly measured by rain gauges in Hawai‘i.
2. Data
A 30-yr time series of daily rainfall data is compiled from several sources, including a previously published dataset covering the period 1990–2014 [see Longman et al. (2018) for a comprehensive description of the climate data networks in Hawai‘i]. Additional data, available between 1990 and 2019, are obtained from several national online data repositories including the Hydrometeorological Automated Data System (HADS; https://hads.ncep.noaa.gov/), National Center for Environmental Information (NCEI; https://www.ncei.noaa.gov/), and the Soil Climate Analysis Network (SCAN; https://www.wcc.nrcs.usda.gov/scan/). Data are also obtained through various local networks and repositories. In total, 622 unique measurement points are identified in the dataset. A map of all the stations utilized in the rainfall mapping effort is shown in Fig. 1.
Rainfall stations used to create monthly gridded surfaces (n = 622).
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
3. Methods
a. Processing daily rainfall data
Rainfall station data are made available in many different formats; therefore, after the data are acquired a first step is to convert them into a consistent standard format that includes SI units, time stamp, and time step. Next, data are screened and subsequently flagged for extraneous outlying values following methods published by Longman et al. (2018). Rainfall data are initially quality-controlled at the station’s native time step (which can range from 5 min to 24 h) and are then aggregated to a daily 24-h accumulated value. For days in which >1 h of rainfall is unavailable, the day is set to missing. Next, gaps in the daily record are partially filled using the normal ratio method following the optimization criteria defined by Longman et al. (2020). When gap-filling is not possible, the day is considered missing. Finally, daily data are aggregated to the monthly time step. Only monthly periods with complete daily records (including observed or filled values) are aggregated to the monthly time step. This dataset is referenced as the Lucas et al. data here forward. Only monthly periods that have complete daily records (observed or filled values) are aggregated to the monthly time step. Months with missing daily data are considered to be missing from the record.
In addition to the Lucas et al. data, a monthly rainfall dataset (1990–2012) published by Frazier et al. (2016) is utilized for method validation purposes (henceforth, Frazier et al. data). The Frazier et al. data, includes 2012 rainfall stations, 811 of which are not included in the Lucas et al. data. Specific uses of these data are described in section 3.
b. Monthly rainfall data preprocessing
1) Anomaly calculation
Relative anomalies are preferred over absolute anomalies (i.e., observed minus mean) because the ratio better preserves the relationship between the mean and the variance (New et al. 2000). In addition, relative anomaly interpolation can be justified under the assumption of rainfall data following the lognormal distribution (described below) or something similar for transition.
2) Anomaly transformation
Finally, the rainfall maps are qualitatively evaluated to identify unrealistic features or patterns including negative values, extremely high (unrealistic) values, spots or large anomalous areas that do not follow typical rainfall gradients, or maps that have unusual rainfall patterns. Of the four constant values tested here, c = 1 produced the fewest unrealistic maps therefore, this constant value is used in Eqs. (2) and (3).
3) Eliminating collocated stations
The autoKrige function is used to interpolate rainfall anomalies automatically removes collocated observation points (stations within the same 250 m × 250 m pixel). Given that not all rainfall networks have the same quality standards, it is important to control which stations are removed to ensure that the highest-quality data are used in the interpolation. To accomplish this, a pixel ID is assigned to all observation stations within a given month and stations are ranked according to the quality of the observational network they belong to. The order in which networks are ranked is determined based on their comprehensive knowledge of historical rainfall monitoring in the state of Hawai‘i. In general, automated data reporting networks with the highest temporal resolution data are ranked first, and networks with manually read gauges are ranked last. Other ranking criteria include network characteristics such as type of instrumentation used, known quality control protocols and procedures, and overall data quality based on past analyses (Giambelluca et al. 2013; Frazier et al. 2016; Longman et al. 2018, 2019). Lower ranked station observations sharing the same pixel ID with higher ranked stations are omitted prior to the interpolation.
c. Monthly rainfall data preprocessing
1) Kriging parameters
Idealized semivariogram with parameters labeled (nugget, sill, and range).
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
The autoKrige function is designed to optimize the kriging model based on the characteristics of the data. However, applying the default model parameter selection can result in overfitting the semivariogram, resulting in maps with extreme rainfall gradients occurring over short distances (e.g., <1 km) surrounding individual station locations (Fig. 3). These types of errors are typically easy to identify manually at the monthly time step where maps should have a much smoother transition between wet and dry areas. The identification and correction of these errors are a critical step in producing a quality dataset.
Preliminary rainfall map of Hawai‘i Island for March 2012, demonstrating how overfitting of semivariogram parameters can produce an unrealistic monthly rainfall surface.
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
2) Identifying unrealistic rainfall maps
An initial run using the default kriging model and parameter settings of the autoKrige function is performed on three statewide monthly rainfall datasets. These datasets include the Lucas et al. and the Frazier et al. monthly rainfall data described in section 3a and a monthly dataset consisting of 2247 stations that combines the Lucas et al. dataset and the Frazier et al. (2016) datasets with preference given to Lucas et al. dataset in instances when the same stations appeared in both datasets.
Rainfall maps are produced using all three datasets for a 23-yr overlapping period (1990–2012). For each dataset, a set of monthly rainfall maps is generated for each of four counties in Hawai‘i [Hawai‘i County, Maui County, Honolulu County (O‘ahu), and Kaua‘i County].
The initial autoKrige default run produced 276 monthly maps for each county and for all three datasets, thus providing a total of 3312 maps (1104 for each dataset and 828 maps of each individual county extent). For all of maps, a manual qualitative inspection is performed to classify the maps into one of three categories: “satisfactory” (acceptable likely rainfall pattern and value range), “poor” (unusual or unlikely rainfall pattern with an acceptable value range), and “unrealistic” (improbable rainfall pattern and/or unfeasible value range). Maps classified as unrealistic, were generally characterized as such due to extremely high or low (or negative) rainfall values that are not in agreement with nearby stations or due an overfitting of the semivariogram parameters [section 3c(1)], typically resulting in a “spotted” or blotchy map. The purpose of the manual inspection is to establish a training dataset that can be used to train a machine learning-based automatic screening algorithm.
The results of this manual classification revealed that overall, the default settings of the autoKrige function performed well with only 3.9% (130 of 3312) of the maps manually classified as unrealistic. Even though the percentage of unrealistic maps is low, the problem must still be addressed as these types may have the potential to bias scientific analyses and results. To resolve the issue, an automated approach is developed to flag unrealistic maps. First, the results of the qualitative map assessment are used to assign a prediction class variable, unrealistic or satisfactory, to each map. Note that the maps previously classified as poor were not included in either category therefore these maps were not used to train the automated detection algorithm. Then the maps from each of the two categories are used to train a random forest machine classification algorithm (Breiman 2001) to identify dubious maps. The input dataset to train and classify maps as unrealistic or not used 85 spatially derived predictor variables many of which are topographic index values such as slope, curvature, roughness, topographic roughness index (TRI), as well as statistical summaries, distance correlations, and statistical significance (Table S1 in the online supplemental material). Random forest was fit and tuned for the mtry and ntree parameters. The optimum mtry (number of variables available for splitting at each tree node) and ntrees (the number of trees) of the random forest classifier was tuned using a tenfold, three repeated cross validation with random search to get best mtry of 51 and a manual grid search was used to determine the optimum ntrees of 500. These tuned parameter values were then used to fit, and evaluate the final random forest classification model. The tunned random forest machine learning algorithm produced an 87% classification accuracy with a kappa statistic (comparison metric of observed accuracy to random chance expected accuracy) of 0.77. The end result is an automated approach for detecting unrealistic maps in terms of the probability of a map being unrealistic.
3) Auto-kriging and variogram parameter schemes
Since unrealistic rainfall maps can occur using the autoKrige function with default settings, fine-tuning of the parameters is a critical next step to ensure that these unrealistic maps are not included in the final dataset. The autoKrige default setting selects one of the five statistical models and then determines the parameters used to fit the model to the empirical semivariogram. Both model type and fitting parameters can also be chosen manually. The sensitivity of model selection and range of the parameters used to fit the model are tested here to determine the optimal geostatistical model and parameterization. Using the output from the initial model run, model type and fitting parameters are compared across the two qualitative classes (unrealistic and satisfactory) for all of the maps produced with the three datasets. This can be viewed as preliminary model validation/screening since interpolated results even with optimized parameters are not necessarily the “best” maps due to the violation of geostatistical model assumptions (e.g., nonstationarity and anisotropy of the datasets that are not properly modeled in the current geostatistics model). For maps classified as satisfactory, the model type used in the interpolation is identified and counted. The model parameters (range, sill, and nugget) used to define the semivariogram are also extracted and separated by the two map classifications (unrealistic and satisfactory). Following the extraction of these parameters, a statistical analysis that utilizes a binomial general linear model (GLM; R Core Team 2018) of the likelihood of a map being unrealistic is conducted. The likelihood of a map being unrealistic (as a binary) is tested as a function of semivariogram model type (discrete), county (discrete), nugget (continuous), range (continuous), and sill (continuous), with an interaction between county and all other variables as predictor variable. This is done to determine which variables are influencing the likelihood of a map being unrealistic. Overall the GLM model had an explained variance of 45%. Each variable in the GLM model was then tested using ANOVA with a chi-squared test. Results showed that semivariogram model type, county, nugget, and range were significant (p < 0.05) as individual variables that determine the likelihood of an unrealistic map while the sill was determined not to be significant.
Given the GLM results, a set of fixed values for both the nugget and range is established based on the 25th percentile, 50th percentile, and 75th percentile of the distributions of these parameters from all of satisfactory maps. These fixed values are identified for each county to create a set of low, medium, and high parameter values (see Table 1).
Fixed values for the nugget and range based on the 25th, 50th, and 75th percentile values from satisfactory maps for each of the four counties.
4) Parameter scheme selection and final map selection
To guarantee the final map products are realistic with highest validation statistics, we provide a practical map selection strategy using different criteria and rankings. The analysis of model type revealed that 80% of the satisfactory maps use the Matern with M. Steins parameterization (henceforth, reparametrized Matern) model to fit the variogram. Based on this result, the reparametrized Matern model is held constant in the autoKrige function for all subsequent iterations of the maps. Beyond this sole fixed covariance model choice, the creation of a month-year rainfall map is a dynamical and tiered process. First tier rainfall maps are created using the Lucas et al. dataset with four model parameterization schemes: 1) “Free-All,” where all variogram parameters are automatically chosen by the default autoKrige function; 2) “Fixed-Nugget,” where the nugget is fixed at a “low” (25th percentile) value; 3) “Fixed-Range,” where the range is fixed at a “low” (25th percentile) value; and 4) “Free-Sill,” where the sill is selected automatically by the default autoKrige function but the nugget and range are fixed at their low (25th percentile) values. For each month, four versions of the map are created using these different parameterization schemes and each map is assigned, using the random forest machine learning classification algorithm, to one of the two classification categories (unrealistic or satisfactory). If only one of the four maps is classified as satisfactory then this map is selected as the final month year rainfall map. If more than one of the four maps is classified as satisfactory, then the final month year rainfall map is selected based on the quality of the map as determined by four error metrics: 1) the percent probability of a map being unrealistic (as determined by the machine learning algorithm described above); 2) the coefficient of determination (R2); 3) the mean absolute error (MAE); and 4) the root-mean-square error (RMSE), derived from cross-validation results (which will be discussed in the next section). First, all four metrics are normalized on a 0–100 scale, including three significant digits. To accomplish this, R2 is converted to a percent, and MAE and RMSE errors are divided by mean rainfall for each respective month year and then converted to an inverted percent (MAE and RMSE errors greater than 100% are set to 0 in the inversion). The map with the highest median value of the four normalized metrics is selected as the final monthly rainfall map. In the event that the median values are the same, the following criteria are used to break ties in this order: lowest probability of being unrealistic, highest R2, lowest MAE, lowest RMSE, and finally random selection (if all other metrics are equal one of the maps is randomly chosen).
If none of the four tier-1 maps have a satisfactory classification, then a second-tier mapping process is executed where, medium (50th percentile) values are fixed for the nugget and the range and maps are created using the same four parameterization schemes. If tier-2 maps do not produce a map with a satisfactory classification, then high (75th percentile) values are fixed for the nugget and the range to create a set of tier-3 maps using the same four parameterization schemes. If a satisfactory classification map is not obtained after the execution of all three mapping tiers then the tier-3 map with the highest median value of the four rescaled error metrics is selected as the final rainfall map. Note that the Free-All method uses the same default parameterization for all three tiers and is performed only once.
This tiered system ensures that a map with the most plausible rainfall pattern and the highest validation statistics is selected as the final month-year map (Fig. 4).
Workflow for parameter scheme selection and final monthly map selection for each month. For each of the three tiers, maps are created using unique parameter configurations including 1) Free-All, where all variogram parameters are automatically chosen by the autoKrige function; 2) Fixed-Nugget, where the nugget is fixed; 3) Fixed-Range, where the range is fixed; and 4) Free-Sill, where the sill is selected automatically by the autoKrige function but the nugget and range are fixed. Fixed variogram parameters are set at the 25th, 50th, and 75th percentile for tiers 1–3, respectively. Note that the Free-All map is the same for all three tiers.
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
d. Map validation
For each map, leave-one-out cross validation (LOOCV) statistics are generated for every station used in the interpolation. LOOCV points are generated by sequentially leaving out one measured data point and reproducing it based on the information from the remaining station observations. LOOCV points are then tested against station observations using several validation metrics such as R2, bias, MAE, and RMSE. Bias is calculated as the average difference between the predicted and observed rainfall at each station location. Positive bias indicates an underprediction of observed values. Relative (normalized), percent errors are also calculated and used to access predictions.
4. Validation results
a. Error metrics all islands
An example of the mapping algorithm output for a single month in Honolulu County is shown in Fig. 5. In this example the final map is produced with the tier-2, Free-Sill parameterization scheme selected (Fig. 5d). Note that the final map had the lowest percent unrealistic classification, highest R2, and lowest RMSE of the four maps (overall lowest rescaled median value) in this tier. In total, the 30-yr monthly (four county) map dataset consists of 127 365 comparative point values (Fig. 6). Overall, the optimized autoKrige approach produced estimates of rainfall that had high correlation (R2 = 0.78) and low bias (34 mm; −1%) when compared with observations (Table 2). The overall positive bias (underprediction of observations) is driven primarily by the inability of the mapping algorithm to accurately predict the highest rainfall values. This underprediction is a result of a Gaussian distribution-based smoothing that occurs with the predicted values, a well-known problem with many interpolation techniques including kriging (Biau et al. 1999). The MAE and RMSE for the entire dataset are 55 mm (1.4%) and 101 mm (7.8%), respectively. The R2 values are calculated for all individual maps and the distribution of R2 values for the entire LOOCV is shown in Fig. 7. The average R2 is 0.78, and 83.4% of the maps had an R2 of 0.65 or better. Of the 1440 county maps, only a total of 14 maps (<1.3%; 14/1440) fell below the R2 = 0.25 threshold.
Tier-2 rainfall maps in January 2006 for Honolulu County created with four parameterization schemes: (a) Free-All, (b) Fixed-Nugget, (c) Fixed-Range, and (d) Free-Sill. The tier-2 “medium” (50th percentile) values are used as fixed model parameters; Class is either unrealistic or satisfactory; %Unrealistic is the percent probability of the map being unrealistic. The R2 is the coefficient of determination, RMSE is the root-mean-square error (mm), and MAE is the mean absolute error (mm). The best performing map (lowest median of rescaled errors) in this instance is the Free-Sill parameterization scheme (i.e., Med 50th%tile Fixed-Nugget and Fixed-Range).
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
Point density validation plot of LOOCV results for all stations/months (n = 127 365) with the magenta box in the LR of the plot representing the 75% of the station observations and black and white diagonal lines are the 1:1 line representing a perfect fit; R2 is the coefficient of determination.
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
Histogram of R2 values calculated from LOOCV results for all counties and month-years. The red dashed line represents the mean r-squared value (0.771), and blue dotted lines represent ±1 standard deviation (0.158).
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
LOOCV error metrics by county and for all counties, where R2 is the coefficient of determination, Bias is the mean bias error (mm; %), MAE is the mean absolute error (mm; %), MED is the median absolute error (mm; %), and RMSE (mm; %) is the root-mean-square error.
b. Error metrics by county
LOOCV statistics are also evaluated for each county (Fig. 8). In general, results were consistent across the four counties, with mean absolute percent error (MAPE) ranging from 1.1% to 1.8%. Honolulu County had the lowest MAPE (1.1%), and Maui County had the highest (1.8%). An overall underestimation of rainfall (from −1% to −0.7%) occurs across the four counties. LOOCV is commonly used for assessing the quality of an interpolation method (Brinckmann et al. 2016; Longman et al. 2019); however, it is not without its shortcomings for the error prediction. LOOCV commonly overestimates errors at a location on the map being evaluated due to the fact that predicted values are derived at a point where data actually exist. In addition, the interpolated surface can be altered as a result of removing a point at the location that is being cross validated, especially when the data are sparse (Jeffrey et al. 2001). This has the greatest impact in the areas with lowest station density and results in higher LOOCV error values because climate stations in Hawai‘i are irregularly spaced and sparse in many regions.
As in Fig. 6, but for individual counties: (a) Hawai‘i County (n = 38 785), (b) Maui County (n = 34 670), (c) Honolulu County (n = 39 046), and (d) Kaua‘i County (n = 14 864).
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
LOOCV results are compared between the autoKrige default parameter runs and the optimized parameter runs for all counties. In general, optimization reduces the map error; however, differences between the runs are not statistically significant. This highlights a key finding, that error metrics should not be the only source of information used to assess the quality of the final map product. This example is clearly illustrated in Fig. 5, where in general, error metrics are similar for all four of the maps, but a visual inspection of the maps leaves no doubt that the spatial pattern produced with the default parameterization scheme (Fig. 5a) is not realistic. In this figure, the Fixed-Range run (Fig. 5c) produced a slightly lower MAE and RMSE than the Free-Sill run (Fig. 5d); however, the Free-Sill optimization was chosen because this map was classified as satisfactory. The “% unrealistic” metric is used here to select the map produced using the Free-Sill parameterization scheme because this is the only map that produced a satisfactory classification. This example highlights how the machine learning classification is utilized to select the map with the most realistic rainfall surface.
c. Temporal variations in map error metrics
To determine the consistency of the maps through time, all R2 values are grouped by year and plotted over the 30-yr period of record (Fig. 9). While some years had fewer outliers than others, no apparent discontinuities are apparent in the time series. The 30-yr median and mean R2 values across all month year county maps are 0.815 and 0.775, respectively.
Box plots of annual R2 values over time (1990–2019) for all county maps (each box contains results from 48 maps and black bar in each box is the median). The blue dashed line represents overall R2 value from all stations and month-years (0.771).
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
Validation metrics as well as metadata on the number of stations used and final variogram selection information is contained in each month-year map text-based metadata file. Examples of a final rainfall maps for each county are shown in Fig. 10.
Final county rainfall maps for a sample month-year (August 2018) (a) Kaua‘i County, (b) Honolulu, (c) Maui County, and (d) Hawai‘i County.
Citation: Journal of Hydrometeorology 23, 4; 10.1175/JHM-D-21-0171.1
5. Summary and usage
The purpose of this research effort is to optimize an automated kriging algorithm to interpolate monthly rainfall in a way that produces realistic rainfall patterns over the complex topography of Hawai‘i. The optimization techniques utilized in this study include methods for identifying the most appropriate constant value to use when log-transforming data, choosing the highest quality station data to use in the interpolation, detecting erroneous maps using a machine learning approach, and establishing the most appropriate parameterization scheme for the interpolation model. While the default parameters in the autoKrige function may produce gridded estimates that validate well, this does necessarily mean that a realistic rainfall surface will be produced. The incorporation of a low-level machine learning algorithm trained to evaluate and classify an unrealistic map output is therefore a critical step in generating a realistic pattern of rainfall. The tiered process allows for the flexibility of selecting a map that does not have unrealistic features that would have otherwise been produced using autoKrige default parameterization scheme.
From a technical standpoint, this optimized kriging approach can reproduce monthly estimates of rainfall in a realistic manner but underestimates the highest rainfall observations due to well-known smoothing effects associated with this interpolation technique. In addition, areas with sparse station density and complex topographical features that are not considered in the geostatistical modeling introduce errors into the final data product. The integration of long-term mean climate maps with point observations helps to improve the accuracy of the monthly maps, as the climatologies capture general geographical features that influence rainfall such as the varied terrain, proximity to the coast, exposure to the prevailing winds, and the influence of the trade wind inversion. While the automated approach described here does not explicitly use elevation as a covariate, consistent, and elevation-dependent orographic rainfall patterns are captured when interpolated anomalies are combined with long-term estimates. The tiered parameterization approach used here for variogram parameter selection is effective at eliminating the occurrence of unrealistic rainfall surfaces produced when using default parameterization in the autoKrige algorithm. Given the known best practices for mapping rainfall in Hawai‘i, the authors are confident in the selection of this approach and pleased with the quality of the results.
This dataset will serve as a valuable resource to the many researchers that use gridded rainfall estimates for analyses of spatial trends and patterns, over time (e.g., Frazier and Giambelluca 2017; Lucas et al. 2020) and during extreme weather events (e.g., Nugent et al. 2020; Longman et al. 2021). Finally, data are being compiled and infrastructure is being developed to update this gridded dataset to present day and produce near-real-time monthly rainfall maps of Hawai‘i using these same methods. This approach demonstrates how, with a moderate amount of data, a low-level machine learning algorithm can be trained to identify unrealistic interpolated rainfall patterns.
Acknowledgments.
We thank Mike Nullet and Gwen Jacobs (University of Hawai‘i at Mānoa), Kevin Kodama (National Weather Service), and the USGS Pacific Islands Climate Adaptation Science Center. NSF Jetstream resources for supporting the development VM infrastructure (Towns et al. 2014; Stewart et al. 2015). This research was funded by Hawai‘i EPSCoR Program, a National Science Foundation Research Infrastructure Improvement (RII) Track-1: ‘Ike Wai: Securing Hawai‘i′s Water Future Award OIA-1557349. The technical support and advanced computing resources from University of Hawai‘i Information Technology Services—Cyberinfrastructure, funded in part by the National Science Foundation MRI Award CNS-1920304, are gratefully acknowledged.
Data availability statement.
Monthly rainfall maps have been published on Hydroshare (Tarboton et al. 2014), CUAHSI’s online collaboration environment for sharing data, models, and code (https://doi.org/10.4211/hs.2275657d62794c2294553919fa94b68d; Lucas et al. 2021) and are also available for visualization and download through an interactive discovery, mapping, and decision support interface the HCDP (http://www.hawaii.edu/climate-data-portal; McLean et al. 2020, 2022). Included in this dataset are month-year statewide files for rainfall, kriging input files that contain station rainfall, station rainfall transformations, station transformed anomaly, and denotation of inclusion in per county kriging process, statewide gridded rainfall, statewide standard error, statewide gridded rainfall anomaly, statewide gridded rainfall anomaly standard errors, and statewide metadata that contain per county as well as statewide cross validation statistics, station counts, a list of station locations, and readable data quality statement. Although the data have been subjected to review, UH reserve the right to update the datasets as needed pursuant to further analysis and review. No warranty, expressed or implied, is made by UH as to the accuracy of the dataset and related material, nor shall the fact of release constitute any such warranty. Furthermore, the data are released on condition that UH shall not be held liable for any damages resulting from its authorized or unauthorized use. All content and results are in the public domain and may be used freely, with appropriate credit given.
REFERENCES
Berezowski, T., M. Szczeniak, I. Kardel, R. Michalowski, T. Okruszko, A. Mezghani, and M. Piniewski, 2016: CPLFD-GDPT5: High-resolution gridded daily precipitation and temperature data set for two largest Polish river basins. Earth Syst. Sci. Data, 8, 127–139, https://doi.org/10.5194/essd-8-127-2016.
Biau, G., E. Zorita, H. Von Storch, and H. Wackernagel, 1999: Estimation of precipitation by kriging in the EOF space of the sea level pressure field. J. Climate, 12, 1070–1085, https://doi.org/10.1175/1520-0442(1999)012<1070:EOPBKI>2.0.CO;2.
Bostan, P. A., G. B. M. Heuvelink, and S. Z. Akyurek, 2012: Comparison of regression and kriging techniques for mapping the average annual precipitation of Turkey. Int. J. Appl. Earth Obs. Geoinf., 19, 115–126, https://doi.org/10.1016/j.jag.2012.04.010.
Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.org/10.1023/A:1010933404324.
Brinckmann, S., S. Krähenmann, and P. Bissolli, 2016: High-resolution daily gridded data sets of air temperature and wind speed for Europe. Earth Syst. Sci. Data, 8, 491–516, https://doi.org/10.5194/essd-8-491-2016.
Buytaert, W., R. Celleri, P. Willems, B. De Bièvre, and G. Wyseure, 2006: Spatial and temporal rainfall variability in mountainous areas: A case study from the south Ecuadorian Andes. J. Hydrol., 329, 413–421, https://doi.org/10.1016/j.jhydrol.2006.02.031.
Daly, C., R. P. Neilson, and D. L. Phillips, 1994: A statistical-topographic model for mapping climatological precipitation over mountainous terrain. J. Appl. Meteor., 33, 140–158, https://doi.org/10.1175/1520-0450(1994)033<0140:ASTMFM>2.0.CO;2.
Dawdy, D. R., and W. B. Langbein, 1960: Mapping mean areal precipitation. Int. Assoc. Sci. Hydrol. Bull., 5, 16–23, https://doi.org/10.1080/02626666009493176.
Deutsch, A., and A. G. Journel, 1992: Geostatistical Software Library and User’s Guide. Oxford University Press, 340 pp.
Ekwaru, J. P., and P. J. Veugelers, 2018: The overlooked importance of constants added in log transformation of independent variables with zero values: A proposed approach for determining an optimal constant. Stat. Biopharm. Res., 10, 26–29, https://doi.org/10.1080/19466315.2017.1369900.
Frauendorf, T. C., R. A. MacKenzie, R. W. Tingley, A. G. Frazier, M. H. Riney, and R. W. El-Sabaawi, 2019: Evaluating ecosystem effects of climate change on tropical island streams using high spatial and temporal resolution sampling regimes. Global Change Biol., 25, 1344–1357, https://doi.org/10.1111/gcb.14584.
Frazier, A. G., and T. W. Giambelluca, 2017: Spatial trend analysis of Hawaiian rainfall from 1920 to 2012. Int. J. Climatol., 37, 2522–2531, https://doi.org/10.1002/joc.4862.
Frazier, A. G., T. W. Giambelluca, H. F. Diaz, and H. L. Needham, 2016: Comparison of geostatistical approaches to spatially interpolate month-year rainfall for the Hawaiian Islands. Int. J. Climatol., 36, 1459–1470, https://doi.org/10.1002/joc.4437.
Frazier, A. G., O. Elison Timm, T. W. Giambelluca, and H. F. Diaz, 2018: The influence of ENSO, PDO and PNA on secular rainfall variations in Hawai’i. Climate Dyn., 51, 2127–2140, https://doi.org/10.1007/s00382-017-4003-4.
Giambelluca, T. W., Q. Chen, A. G. Frazier, J. P. Price, Y. L. Chen, P. S. Chu, J. K. Eischeid, and D. M. Delparte, 2013: Online rainfall atlas of Hawai’i. Bull. Amer. Meteor. Soc., 94, 312–316, https://doi.org/10.1175/BAMS-D-11-00228.1.
Goovaerts, P., 2000: Geostatistical approaches for incorporating elevation into the spatial interpolation of rainfall. J. Hydrol., 228, 113–129, https://doi.org/10.1016/S0022-1694(00)00144-X.
Hattermann, F. F., M. Wattenbach, V. Krysanova, and F. Wechsung, 2005: Runoff simulations on the macroscale with the ecohydrological model SWIM in the Elbe catchment – Validation and uncertainty analysis. Hydrol. Processes, 19, 693–714, https://doi.org/10.1002/hyp.5625.
Hengl, T., G. B. M. Heuvelink, and D. G. Rossiter, 2007: About regression-kriging: From equations to case studies. Comput. Geosci., 33, 1301–1315, https://doi.org/10.1016/j.cageo.2007.05.001.
Hiemstra, P., 2015: Package “automap.” R package version 1.0-14, 15 pp., https://cran.r-project.org/web/packages/automap/automap.pdf.
Jeffrey, S. J., J. O. Carter, K. B. Moodie, and A. R. Beswick, 2001: Using spatial interpolation to construct a comprehensive archive of Australian climate data. Environ. Modell. Software, 16, 309–330, https://doi.org/10.1016/S1364-8152(01)00008-1.
Kitanidis, P. K., 1997: Introduction to Geostatistics. Cambridge University Press, 36 pp.
Krushelnycky, P. D., F. Starr, K. Starr, R. J. Longman, A. G. Frazier, L. L. Loope, and T. W. Giambelluca, 2016: Change in trade wind inversion frequency implicated in the decline of an alpine plant. Climate Change Responses, 3, 1, https://doi.org/10.1186/s40665-016-0015-2.
Longman, R. J., and Coauthors, 2018: Compilation of climate data from heterogeneous networks across the Hawaiian Islands. Sci. Data, 5, 180012, https://doi.org/10.1038/sdata.2018.12.
Longman, R. J., and Coauthors, 2019: High-resolution gridded daily rainfall and temperature for the Hawaiian Islands (1990–2014). J. Hydrometeor., 20, 489–508, https://doi.org/10.1175/JHM-D-18-0112.1.
Longman, R. J., A. J. Newman, T. W. Giambelluca, and M. Lucas, 2020: Characterizing the uncertainty and assessing the value of gap-filled daily rainfall data in Hawaii. J. Appl. Meteor. Climatol., 59, 1261–1276, https://doi.org/10.1175/JAMC-D-20-0007.1.
Longman, R. J., O. E. Timm, T. W. Giambelluca, and L. Kaiser, 2021: A 20-year analysis of disturbance-driven rainfall on O‘ahu, Hawai‘i. Mon. Wea. Rev., 149, 1767–1783, https://doi.org/10.1175/MWR-D-20-0287.1.
Lucas, M. P., C. Trauernicht, A. G. Frazier, and T. Miura, 2020: Long-term, gridded standardized precipitation index for Hawai‘i. Data, 5, 109, https://doi.org/10.3390/data5040109.
Lucas, M. P., R. J. Longman, T. W. Giambelluca, A. G. Frazier, J. Mclean, S. B. Cleveland, Y. Haung, and J. Lee, 2021: Hawaii 1990–2019 gridded monthly rainfall mm. Hydroshare, accessed 5 December 2021, https://doi.org/10.4211/hs.2275657d62794c2294553919fa94b68d.
Mair, A., and A. Fares, 2011: Comparison of rainfall interpolation methods in a mountainous region of a tropical island. J. Hydrol., 16, 371–383, https://doi.org/10.1061/(ASCE)HE.1943-5584.0000330.
Mair, A., A. G. Johnson, K. Rotzoll, and D. S. Oki, 2019: Estimated groundwater recharge from a water-budget model incorporating selected climate projections, Island of Maui, Hawai’i. Scientific Investigations Rep. 2019-5064, 46 pp., https://doi.org/10.3133/sir20195064.
McLean, J. H., S. B. Cleveland, M. P. Lucas, R. J. Longman, T. W. Giambelluca, J. Leigh, and G. A. Jacobs, 2020: The Hawai‘i Rainfall Analysis and Mapping Application (HI-RAMA): Decision support and data visualization for statewide rainfall data. PEARC ′20: Practice and Experience in Advanced Research Computing, Portland, OR, SIGAPP/SIGHPC, 239–245, https://doi.org/10.1145/3311790.3396668.
McLean, J. H., S. B. Cleveland, M. Dodge Ii, M. P. Lucas, R. J. Longman, T. W. Giambelluca, and G. A. Jacobs, 2022: Building a portal for climate data—Mapping automation, visualization, and dissemination. Concurrency Comput. Pract. Exp., https://doi.org/10.1002/cpe.6727, in press.
Moral, F. J., 2010: Comparison of different geostatistical approaches to map climate variables: Application to precipitation. Int. J. Climatol., 30, 620–631, https://doi.org/10.1002/joc.1913.
New, M., M. Hulme, and P. Jones, 2000: Representing twentieth-century space-time climate variability. Part II: Development of 1901–96 monthly grids of terrestrial surface climate. J. Climate, 13, 2217–2238, https://doi.org/10.1175/1520-0442(2000)013<2217:RTCSTC>2.0.CO;2.
Newman, A. J., M. P. Clark, R. J. Longman, and T. W. Giambelluca, 2019a: Methodological intercomparisons of station-based gridded meteorological products: Utility, limitations, and paths forward. J. Hydrometeor., 20, 531–547, https://doi.org/10.1175/JHM-D-18-0114.1.
Newman, A. J., M. P. Clark, R. J. Longman, E. Gilleland, T. W. Giambelluca, and J. R. Arnold, 2019b: Use of daily station observations to produce high-resolution gridded probabilistic precipitation and temperature time series for the Hawaiian Islands. J. Hydrometeor., 20, 509–529, https://doi.org/10.1175/JHM-D-18-0113.1.
Nugent, A. D., R. J. Longman, C. Trauernicht, M. P. Lucas, H. F. Diaz, and T. W. Giambelluca, 2020: Fire and rain: The legacy of hurricane lane in Hawai‘i. Bull. Amer. Meteor. Soc., 101, E954–E967, https://doi.org/10.1175/BAMS-D-19-0104.1.
R Core Team, 2018: R: A language and environment for statistical computing. R Foundation for Statistical Computing, https://www.R-project.org/.
Stein, M. L., 1999: Interpolation of Spatial Data: Some Theory for Kriging. Springer, 93 pp.
Stewart, C. A., and Coauthors, 2015: Jetstream: A self-provisioned, scalable science and engineering cloud environment. Proc. of the 2015 XSEDE Conf.: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, St. Louis, MO, XSEDE, 29, https://doi.org/10.1145/2792745.2792774.
Tarboton, D. G., and Coauthors, 2014: HydroShare: Advancing collaboration through hydrologic data and model sharing. Seventh Int. Congress on Environmental Modelling and Software, San Diego, CA, International Environmental Modelling and Software Society, 29 pp., https://scholarsarchive.byu.edu/iemssconference.
Towns, J., and Coauthors, 2014: XSEDE: Accelerating scientific discovery. Comput. Sci. Eng., 16, 62–74, https://doi.org/10.1109/MCSE.2014.80.
Webster, R., and M. A. Oliver, 2001: Geostatistics for environmental scientists. Statistics in Practice, 2nd ed. John Wiley & Sons, 300 pp.
Willmott, C. J., and S. M. Robeson, 1995: Climatologically aided interpolation (CAI) of terrestrial air temperature. Int. J. Climatol., 15, 221–229, https://doi.org/10.1002/joc.3370150207.