Evaluating the benefits of merging near-real-time satellite precipitation products A case study in the Kinu basin region, Japan

After the launch of the Global Precipitation Measurement (GPM) mission in 2014, many satellite precipitation products (SPPs) are available at ﬁner spatiotemporal resolution and/or with reduced latency, potentially in-creasingtheapplicabilityofSPPsfornear-real-time(NRT)applications.Therefore,thereisaneedtoevaluatethe NRT SPPs in the GPM era and investigate whether bias-correction techniques or merging of the individual products can increase the accuracy of these SPPs for NRT applications. This study utilizes ﬁve commonly used NRT SPPs, namely, CMOPRH RT, GSMaP NRT, IMERG EARLY, IMERG LATE, and PERSIANN-CCS. The evaluation is done for the Kinu basin region in Japan, an area that provides observed rainfall data with high accuracy in space and time. The selected bias correction techniques are the ratio bias correction and cumulative distribution function matching, while the merged products are derived with the error variance, inverse error varianceweighting,andsimpleaveragemergingtechniques.Basedontheresults,allSPPsperformbestforlower-intensity rainfall events and have challenges in providing accurate estimates for typhoon-induced rainfall (gen- erally more than 50% underestimation) and at very ﬁne temporal scales. Although the bias correction techniques successfully reduce the bias and improve the performance of the SPPs for coarse temporal scales, it is found that for shorter than 6-hourly temporal resolutions, both techniques are in general unable to bring improvements. Finally, the merging results in increased accuracy for all temporal scales, giving new perspectives in utilizing SPPs for NRT applications, such as ﬂood and drought monitoring and early warning systems.


Introduction
Precipitation is a major component of the global water cycle and the main forcing in hydrological processes. Its accurate estimation in space and time is of immense Denotes content that is immediately available upon publication as open access. importance for decision-making and planning for a broad range of applications. Lately, due to the limited availability of adequate ground-based observations in many areas and the advances in remote sensing, there is an increasing interest in satellite precipitation products (SPPs). These products have near-global coverage, are freely available, and provide rainfall estimates at reasonably fine spatial and temporal resolution. There exists an extensive literature related to the evaluation of these products and/or the possibility of using them in hydrological applications in different areas, as, for example, for catchments in Asia (Xue et al. 2013;Long et al. 2016;Kim et al. 2017), South America (Collischonn et al. 2008;Dinku et al. 2010), North America (Yilmaz et al. 2005), Africa (Stisen and Sandholt 2010), Australia (Woldemeskel et al. 2013), and Europe (Lo Conti et al. 2014;Duan et al. 2016).
Research has shown that SPPs come with limitations and their performance varies across different areas. For example, Hughes (2006), Brown (2006), and Asadullah et al. (2008) found that the Precipitation Estimation from Remotely Sensed Information Using Artificial Neural Networks (PERSIANN), a commonly used SPP, overestimated the rainfall in South Africa, Indian subcontinent and high elevations of Uganda, whereas according to Hirpa et al. (2010) it severely underestimated the rainfall at high altitudes in Ethiopia. Xie and Xiong (2011) opinioned that all SPPs have spatially varying, temporally changing, and rangedependent biases. Many studies applied bias correction for improving the quality of the products. Such techniques are, for example, the mean field bias (Smith and Krajewski 1991;Borga et al. 2002) and the ratio bias correction (Arias-Hidalgo et al. 2013). Probabilistic methods such as the quantile mapping are also used (Xie and Xiong 2011).
Nevertheless, after bias correction, errors have still been found. These errors are associated with limitations on the sensors of the satellites, the processing algorithms, and the selected bias-correction techniques (e.g., Madadgar et al. 2014;Xie et al. 2017). A merging of the individual SPPs though, may produce a dataset with possibly fewer errors than the original products, as it is already identified not only for SPPs (Shen et al. 2014;Khairul et al. 2018), but also rainfall estimates in general (Beck et al. 2017). There exists a variety of available techniques, each of them with its own merits (Hasan et al. 2016).
Despite the availability of near-real-time (NRT) SPPs, most of the research has focused on the gaugecorrected SPPs that have prolonged latency periods. In contrast, the information provided by NRT SPPs has latency from a couple of hours to a couple of days and can be very useful for applications such as early warning systems. NRT SPPs have already been used for assessment of the flood extent and consequent early warning systems, as, for example, for the Beneu River in Nigeria , for the city of Riyadh in Saudi Arabia (Tekeli and Fouli 2016), for the Awash River in Ethiopia (Koriche and Rientjes 2016), as well as in global scale (Wu et al. 2014).
After the launch of the Global Precipitation Measurement (GPM) mission in February 2014, some of the NRT SPPs make use of the new satellites and have improved algorithms, resulting in potentially increased accuracy and greater applicability to NRT operations. Therefore, there is a need to evaluate the NRT SPPs in the GPM era, identify their errors, and quantify their accuracy at different rainfall intensities. Moreover, there is a necessity to investigate whether bias-correction techniques, and merging of the individual sets, can increase the accuracy for NRT applications and improve the rainfall estimates, as it is already found for other rainfall products (e.g., Adler et al. 1993;Mitra et al. 2003;Xie and Xiong 2011;Shen et al. 2014;Khairul et al. 2018). This is more crucial, since the majority of the published works is associated with analysis for daily up to monthly scales (Nikolopoulos et al. 2013), and thus the merits and limitations of the techniques are not well recorded for fine temporal scales. This is the main motivation of the current study.
In this work five NRT SPPs (Table 1) are evaluated: 1) Climate Prediction Center morphing technique in real time (CMORPH RT); 2) Global Satellite Mapping of Precipitation in near-real time (GSMaP NRT); 3) Integrated Multi-Satellite Retrievals for GPM (IMERG) EARLY; 4) IMERG LATE; and 5) PERSIANN-Cloud Classification System (PERSIANN-CCS). The products are referred to as CMORPH, GSMaP, EARLY, LATE, and PERSIANN respectively (the bold part of the full names on Table 1). The SPPs are evaluated through a case study in the Kinu region in Japan for the period February 2015 to December 2016, a period that is common for all selected SPPs (the period February-December 2015, is referred as year 2015 in the rest of this study).

a. Study area
The Kinu basin (1761 km 2 ) and its surrounding area that are used for this study (15 116 km 2 ) lies between 35842 0 and 37812 0 N and between 139806 0 and 140818 0 E on the Honshu island of Japan (Fig. 1). The Kinu River has a length of about 177 km and is the longest tributary of the Tone River (Niroshinie et al. 2016). The variability in the topographic features of the area results in significant differentiation of the meteorological characteristics. In the headwaters of the catchment the elevation is more than 2000 m and the rainfall varies between 1600 and 2100 mm, while downstream the topography is quite flat with elevation below 200 m and rainfall around 1400 mm (Yasuda et al. 2016). There are substantial seasonal differences in rainfall between the cloudy and rainy period of June-July (plum rain season, called baiu in Japanese; Taniguchi 2016), the wet season (July-September) when typhoons occur, and the remaining months of the year.
Factors influencing the selection of this area include the density of the available rain gauge network and the importance of the basin to the region. The river originates in the Nikko area, with its shrines and temples being listed as UNESCO heritage monuments, while the basin has a total population of about 550 000 people with many cities serving as commuter locations feeding Tokyo. Another important reason for selecting this basin is the occurrence of three severe typhoon events (Etau, Nangka, and Midulle) within the study period that provides the opportunity to analyze the accuracy of the SPPs at high rainfall intensities. In particular, Typhoon Etau brought an accumulated rainfall of up to 600 mm locally in less than 48 h, leading to severe flooding and substantial damage, with more than 7000 affected buildings (Yasuda et al. 2016). Finally, there was an interest to explore the applicability of SPPs in small basins.
b. Data used

1) RAIN GAUGE OBSERVATIONS
The area has an extensive network of rain gauges operated by two authorities [Automated Meteorological Data Acquisition System (JMA 2019) and Hydrological and Water Quality Database (Water Information System 2019)] with a total of 199 gauges. Although the gauges record every 10 min, the available data are preprocessed and have hourly resolution. After the necessary quality control (maximum of 20% missing data allowed), 137 gauges are used ( Fig. 1), with each point of the comparison area being less than 30 km distance from at least one rain gauge.

2) SATELLITE PRECIPITATION PRODUCTS
There are two categories of meteorological satellites that provide data for rainfall estimates: the low-Earthorbiting (LEO) satellites that mainly use passive microwave sensors (PMW) and the geostationary (GEO) satellites that mainly use infrared (IR) sensors. The SPPs use data from one or more of these types of satellites.
CMORPH-RT was developed by the National Oceanic and Atmospheric Administration (NOAA; Joyce et al. 2004). It provides 30-min mean precipitation on a 0.07278 3 0.07278 grid over the globe (608S-608N) with 2-h latency. The data are available since the end of January 2015, while from November 2017 onward the data come with 3 h of latency and improved algorithms. CMORPH uses the rainfall estimates derived from the LEO-PMW sensors and propagates them in time, by information provided by the GEO-IR sensors.
GSMaP NRT is provided by the Japanese Aerospace Exploration Agency (JAXA; Kubota et al. 2007). It gives precipitation information on a global longitude scale and 608N-608S latitude from October 2008 and has a spatial and temporal resolution of 0.108 3 0.108 and 1 h, respectively. The latency of the product is 4 h. It should be noted that the GSMaP NOW version (0-h latency) was undergoing an update for data correction at the time of this study, and thus it was not used. The GSMaP algorithm combines GEO-IR and LEO-PMW data and employs the Kalman filter to update the rain rate produced by the forward propagation of the precipitation obtained from the microwave sensors (Ushio et al. 2009). Version 6 of the product is used for this study.
The two IMERG products used in this study, are provided by NASA (Huffman et al. 2015(Huffman et al. , 2017, as part of the GPM mission, a joint project of NASA and JAXA. The products come at a 30-min temporal resolution and a grid of 0.108 3 0.108 covering the globe (608S-608N). Version 4 (V4A, latest available at the time of the analysis) is used, which provides data from the end of December 2014 onward. EARLY has ;5-h latency, and LATE has 15-h latency (in the latest versions EARLY and LATE have 4and 12-h latency respectively). The products use a unified algorithm; the data are derived from a constellation of LEO satellites and are merged and intercalibrated, while data obtained from the GEO satellites are used for rainfall propagation. These products come with a range of embedded precipitation subsets: the subset used in this study is the recommended one, multisatellite precipitation estimate with gauge calibration (precipitationCal).
PERSIANN-CCS was developed at the University of California, Irvine Hong et al. 2004). It provides hourly precipitation data at 0.048 3 0.048 resolution grid and global coverage (608S-608N) with a latency of 1 h and temporal coverage from January 2003. An algorithm is used for extracting cloud features from GEO-IR data that are consequently used for providing rainfall estimates employing an artificial neural network.

Methodology
The methodology consists of three steps and is illustrated in Fig. 2. For the steps two and three the data are divided into calibration (year 2015) and validation (year 2016) sets. a.
Step 1: Initial data processing The selected SPPs have different temporal and spatial resolutions. Some products have subhourly resolution and are aggregated to hourly. Moreover, when necessary, SPPs are resampled from their original resolution to 0.108 3 0.108. In particular, the resampling is done on PERSIANN and CMOPRH because their resolution is 0.048 3 0.048 and ;0.078 3 0.078, respectively. This resampling is done by area weighted averaging of the old grid cells (e.g., 0.048 3 0.048) to the 0.108 3 0.108 resolution. The gauges are not used during the resampling of the SPPs. The resolution 0.108 3 0.108 is selected because it is the finest resolution at which all five SPPs could be resampled, without introducing additional uncertainty and possible errors by downscaling any of the products ( Table 1). The few missing values (less than 2.3% for each grid cell for all SPPs) are infilled by linear interpolation for up to 3-hourly gaps, and with 0 mm for longer gaps, since this value corresponds to the vast majority of the rainfall data; note that the gauge-derived accumulated rainfall at the Kinu basin during the gaps of the SPPs was less than 1.5% of the total rainfall. This filling technique is selected due to the very fine temporal resolution of the SPPs. Also, spatial interpolation would introduce more uncertainty, because most of the missing values are clustered in nearby grid cells. Finally, the time zone of all SPPs is shifted from UTC to the Kinu basin's local time (UTC 1 9 h).
The missing values that exist in the rain gauges measurements (the mean percentage of missing data across all gauges is less than 1%) are infilled. The inverse distance weighting (IDW) interpolation method (Xu et al. 2015) is selected because of the dense gauge network and the fine temporal resolution of the data. Finally, the ordinary kriging method is used for creating a gridded dataset from the gauge measurements (at the same 0.108 3 0.108 resolution grid as the SPPs).

b. Step 2: Bias correction
For this study the selected bias correction techniques are the ratio bias correction (RBC) and the cumulative distribution function (CDF) matching. The SPPs have varying errors for different types or rainfall events, intensities, seasons, and areas (Dinku et al. 2008;Xie et al. 2011;Gao and Liu 2013); thus, both techniques are applied on the gauge locations and interpolated on the 0.108 3 0.108 resolution grid. The corresponding rainfall of each SPP at each gauge location is derived from the original spatial resolution of the product before any necessary upscaling (e.g., CMORPH, PERSIANN). Moreover, a temporal classification is applied since the precipitation in the study area is seasonal. The data are divided into rainy/wet (June-September) and nonrainy/ dry (October-May) seasons. The record length is not sufficient to consider finer temporal scales for calculating the bias correction factors. If a long period of data (rain gauge and SPPs) is available, monthly scale should be considered (Arias-Hidalgo et al. 2013).

1) RATIO BIAS CORRECTION
The RBC technique for improving the SPPs has been widely used due to its simplicity and good performance (e.g., Adler et al. 2000;Arias-Hidalgo et al. 2013;Bhatti et al. 2016). The particular steps followed are presented below: (i) Calculation of the correction factor f i,e for each gauge i and season e, by dividing the accumulated rainfall of the gauge with the accumulated rainfall of the corresponding grid cell for each SPP: (ii) Interpolation of the correction factors with the IDW technique to create the final gridded correction factor f j,e . (iii) Multiplication of the original SPP at each time step and grid cell with the corresponding gridded factor for the specific season, leading to the bias corrected version of the product: In Eqs. (1) and (2), P g i,e and P s i,e are the hourly rainfall recorded at the rain gauge location i and the season e from the rain gauge and the corresponding grid cell of the SPP, respectively. Parameter N is the total number of rainfall measurements in season e, P RBC j is the RBC corrected rainfall of the selected SPP at the grid cell j, f j,e is the bias correction factor at the grid cell j for the season e, and P s j,e is the original rainfall of the SPP at the grid cell j for season e.

MATCHING
The CDF method is already successfully implemented for various rainfall products (e.g., Huffman et al. 2004;Ines and Hansen 2006;Xie and Xiong 2011;Serrat-Capdevila et al. 2016). This technique transforms the rainfall estimates in order to achieve a similar CDF with the observed data. For this study, the method is applied at hourly time step, for maintaining the low latency of the original products and being able to utilize them for NRT applications. The steps are as follows: (i) CDF matching for every gauge and corresponding grid cell at an hourly time step: (ii) Calculation of the difference between the original and CDF-corrected SPP at each gauge location and each time step (Dif i ): (iii) IDW interpolation of the differences at 0.108 3 0.108 resolution grid (Dif j ). (iv) Addition of this gridded difference to each grid j of original satellite data P s j , resulting in the biascorrected version of the product P CDF j : In Eqs.
(3)-(5), F s i is the CDF of the SPP rainfall estimation of the grid cell corresponding to the rain gauge i, F 21g i is the inverse CDF rainfall measurement of the rain gauge i, and P cor i and P s i are the corrected and actual rainfall estimation by the SPP of the grid cell corresponding to the rain gauge i.
It should be noted that due to the extreme skewness of the data (;90% of time steps rainfall is zero) and the inability to fit a theoretical distribution, the nonparametric empirical cumulative distribution and the subsequent quantile mapping are used.

c. Step 3: Merging
Following the recommendation of previous studies (Hasan et al. 2016;Khairul et al. 2018), this study evaluates the potential benefits of merging NRT SPPs. For merging, a linear combination of the SPPs is considered. LATE is excluded because its latency (15 h) is significantly larger than the highest latency of the remaining four products (;5 h for EARLY).
The merging is performed for a range of temporal scales. For merging, the optimal version of each SPP at the chosen temporal scale is selected, choosing between the original SPP, and the two bias-corrected versions. The optimal version is defined by the calibration results, based on the lowest normalized root-mean-square error (NRMSE) on the Kinu basin calculated by the corresponding grid cells, and not based on the lumped basin's rainfall statistics. This is preferred for considering in a direct way the spatial variation of the performance of the SPPs.
The variance of the errors (denoted var) is the selected indicator for the merging of the SPPs. For each SPP the error between the gauge observation and the corresponding grid cell is calculated for each time step. The var is derived for the wet and dry season at the gauges and interpolated with the IDW method on the 0.108 3 0.108 resolution grid.
It should be noted that the selection of the variance assumes that the individual products have normally distributed and unbiased errors, which is questionable for the original versions of the SPPs, increasingly so for finer time scales. In previous studies it is found that the bias correction techniques improve the results for daily and coarser temporal scales (e.g., Xie et al. 2011;Khairul et al. 2018). Thus, for taking to account any potential limitations of the selected bias correction techniques for very fine temporal scales, the merging is performed not only at subdaily, but also daily and coarser scales.
There are three merging techniques used in this study, namely, (i) the error variance, (ii) the inverse error variance weighting (IEVW), and (iii) the simple average (Average). The weight of each SPP for the merging of the products is calculated based on the results of the var and the selected merging techniques. The formula for calculating the final output is given in Eq. (6): where P Merg j is the precipitation of merged product at the grid cell j; n is the number of satellite products used for merging (here 4); and W k,e,j and P k,e,j are, respectively, the weight and rainfall of the satellite product k for the season e and the grid cell j.

1) ERROR VARIANCE
This method was implemented by Hasan et al. (2016) for combining data from radars and rain gauges, and by Woldemeskel et al. (2013) for combining rain gauge and TRMM data. One of the assumptions of this method is that the errors of the individual products are uncorrelated.
The formula used for calculating the weight of each product is W k,e,j 5 1 n 2 1 3 å n k51 var k,e,j 2 var k,e,j å n k51 var k,e,j , where n is the number of products used for merging, k is the satellite product for which the weight will be calculated, e and j the selected season and grid cell respectively and var is the variance of estimation errors of the product k for the season e at the grid cell j.

2) INVERSE ERROR VARIANCE WEIGHTING
This technique is already used for the Global Precipitation Climatology Project (GPCP) for producing a dataset of global monthly precipitation estimates (Huffman et al. 1997). The formula used for calculating the weight of each product is with the explanation of the symbols used being the same as the error variance method [Eq. (7)].

3) SIMPLE AVERAGE
The weights for each product are equal and depend only on the number of products. The formula used is given in Eq. (9): with n being the number of products used for merging. In literature, it is reported that simple averaging of various forecasts for microeconomic time series outperformed more complicated schemes of weighting of the individual datasets, a finding called as the ''forecast combination puzzle'' (Stock and Watson 1999. Thus, it can be assessed whether similar findings apply also on merging rainfall estimates of SPPs.

d. Comparison indicators
For the above mentioned three steps, the SPPs are compared with the gauges. The analysis is based on multiple indicators for a range of temporal (hourly up to monthly) and spatial (gauge location, grid cell, basin) scales.
The selected quantitative indicators are the correlation coefficient R [Eq. (10)], relative bias [Eq. (11)], and NRMSE [Eq. (12)]: rel. bias (%) 5 (12) where P g and P s are the rainfall estimates by gauge and satellite rainfall product, respectively, at each gauge/ grid cell or basin level, depending the spatial scale, P g is the average rain gauge rainfall, and N is number of observations for the various temporal scales of the analysis. The calculations are undertaken for all time steps (unconditional) and only the time steps where rainfall is recorded according to the gauge data (conditional). The R and NRMSE are calculated for all temporal and spatial scales, while the relative bias is calculated for each spatial scale at the hourly temporal scale since it is not affected by the different temporal resolutions (for the unconditional analysis).
The contingency table classifies the satellite data based on the correct and false identification of a specific rainfall threshold (Table 2). More specifically, the data are classified as hits (a), when both satellite rainfall (scheme) and gauge rainfall (observation) are above a preset threshold; false alarm (b), when scheme is above the threshold but observation not; misses (c), when scheme is below the threshold while observation is above; and correct negatives (d), when both scheme and observation are below the threshold.
The remotely sensed products are evaluated at the grid cell spatial resolution for a range of precipitation thresholds and for two temporal scales. More specifically, for hourly temporal scale and intensities of 0.5, 1, 1.5, 2, 2.5, 5, 7.5, 10, 15, 20, and 30 mm h 21 , and for 2-day temporal scale and intensities of 0. 5,2,5,10,15,20,25,50,75,100,200, and 300 mm (2 days) 21 . These two temporal scales are selected for analyzing the performance of the SPPs on the finest possible temporal scale for the particular case study, as well as on a coarser scale that can nevertheless be utilized for NRT applications (in medium/large catchments). To increase the sample size, especially for high rainfall events which are rare, the data are clustered into lowlands (,750 m; 114 grid cells) and highlands ($750 m; 38 grid cells) and each group is analyzed as one set. This results in a certain loss of information on the spatial distribution of the indicators but is used due to the limited temporal coverage of the data. Based on the contingency table (Table 2) Values range from 0 to 1, with 0 being the best score: Heidke skill score [HSS; Eq. (15)]: It measures the fractional improvement of the scheme over the correct identification of an event due to chance. Values range from 2' to 1. Negative scores indicate that better results can be provided by chance, 0 means no skill, and 1 is the best score: 1 c) 3 (c 1 d) 1 (a 1 b) 3 (b 1 d) .

Results and discussion
It should be noted that the results of this study refer to the specific versions of the SPPs that are used for the analysis. The subsequent versions might perform differently.

a. Evaluation of SPPs
The SPPs have varying performance with CMORPH, EARLY, and LATE underestimating the total rainfall for most of the area, while GSMaP and PERSIANN overestimate it (Fig. 3). All the SPPs have also challenges in depicting the spatial distribution of the accumulated rainfall.
The accumulated monthly rainfall of the SPPs is also quite variable (Fig. 4). EARLY and LATE have substantial differences on their behavior between the two years. For example, whereas the rainfall is underestimated in April, May, and July 2016, it is overestimated  The SPPs with the best performance based on NRMSE and correlation are the EARLY and LATE (Table 3). In general, the performance of all products increases with coarser temporal scales, as also concluded by other studies (e.g., Gaona et al. 2016). LATE outperforms EARLY, as it incorporates more information, due to its extra latency. CMORPH has very good correlation and outperforms all other products for coarser than daily temporal scales. GSMaP has moderate performance and PERSIANN very low. The drop of correlation for coarse scales for some SPPs can be attributed to the fact that for these scales there is limited sample size for robust   statistics. The results for conditional and unconditional analysis vary mainly for the NRMSE. As expected, the NRMSE is higher for the unconditional analysis, since the mean rainfall decreases substantially. GSMaP and PERSIANN have the best performance based on relative bias (Table 3), which is nevertheless misleading. Both products underestimate the rainfall in the wet season (Fig. 4) and in the upper parts of the basin whereas they overestimate the rainfall in the dry season and in the lower parts of the basin, and these differences are mutually cancelled.
The high differences for all SPPs between the unconditional and conditional results on relative bias can be attributed to the drizzling effect and the temporal scale used to derive the indicator. Due to the drizzling effect the SPPs record very low rainfall intensities when there is no actual rainfall (Piani et al. 2010;Valdés-Pineda et al. 2016). Moreover, in the hourly scale the drizzling effect influences the results more, compared to coarser scales, where the conditional relative bias is approaching the unconditional one (e.g., the conditional relative bias at daily scale for the GSMaP and IMERG is 10.31% and 234.70%, respectively).
Performance is also increased by aggregating in space (Table 4), which is in agreement with other studies (e.g., Bell and Kundu 2003). All SPPs apart from PERSIANN (which performs poorly in general) have increasing coefficient of determination R 2 from point comparison to grid cell and basin comparison (note that the basin entries are the squares of the correlation entries for 2-day unconditional analysis in Table 3).
When the SPPs are compared for the typhoon-induced rainfall events (Table 5 for all events, and Fig. 5 for the Etau event), all of them fail to capture most of the rainfall, apart from the EARLY and LATE during Typhoon Nangka. Previous studies also showed that, in general, SPPs cannot capture high-intensity events accurately (e.g., Bitew and Gebremichael 2010;Nikolopoulos et al. 2013;Huang et al. 2013;Chen et al. 2014;Anjum et al. 2016).
One important reason for the underestimation seems to be the limitations of the satellites' sensors. More specifically, PMW and IR sensors struggle to depict rainfall caused by warm clouds over land (Petty and Krajewski 1996;Hobouchian et al. 2017), suggesting that typhoongenerated rainfall could be challenging. This applies especially for the Etau and Mindulle cyclones, whose centers were close to the study area, resulting in high temperatures. Moreover, LEO satellites' sensors have challenges in depicting the orographic enhancement in complex terrains (Petty and Krajewski 1996;Dinku et al. 2008;Derin and Yilmaz 2014).
One more challenge that is identified from the analysis and could influence the accuracy of the SPPs is a spatial shifting of the rainfall for all the combined PMW-IR based SPPs that use morphing techniques based on information derived from IR sensors (all besides PER-SIANN). This is very clear for the Nangka event and EARLY for particular time steps [ Fig. 6; e.g., at 1200 local time (LT) EARLY rainfall is shifted southwest]. For quantifying this error, the spatial correlation of EARLY with the rain gauges interpolated rainfall is calculated for each time step and a range of shifting combinations (Fig. 6c). The results show that a spatial shifting of the product substantially increases its performance (e.g., at 1200 LT the correlation increases from 0.34 up to 0.86 for shifting EARLY northwest). All SPPs (besides PERSIANN) exhibit similar behavior during all three typhoons. Thus, the shifting error, as it is already described for GSMaP (Ozawa et al. 2011;Chen et al. 2019), seems to exist for other SPPs as well.
As with the quantitative assessment, EARLY and LATE outperform the rest SPPs on the contingency table analysis (Fig. 7). Again, the performance is improved at coarser temporal resolutions. The results show that the SPPs have two types of errors, with both errors increasing with higher rainfall intensities; an important remark given one of the potentially useful applications of SPPs is flood forecasting. First, the products fail to detect many precipitation events (Figs. 7a,b), and second, many times the SPPs wrongly estimate rainfall above the threshold for lower-intensity events (Figs. 7c,d). These errors have high impact on the HSS score (Figs. 7e,f), which deteriorates remarkably for high intensities. The elevation also affects the performance and, in general, SPPs perform better for the lowlands, possibly due to the challenges in depicting the orographic enhancement in complex terrains. It can be concluded that the errors of the SPPs not only vary spatially and temporally but also depend upon the rainfall magnitude. Similar results were also found in previous studies (e.g., Bitew and Gebremichael 2010). This finding is important when selecting bias correction techniques, since it would be advisable to select methods that take into consideration this behavior (e.g., CDF matching technique). For high-intensity events, EARLY and LATE are considerably better than the rest in terms of POD. Their performance is in agreement with the results of Gaona et al. (2016), who evaluated a different SPP from the IMERG family (IMERG FINAL) over the Netherlands. CMORPH outperforms the other SPPs in terms of FAR for high intensity rainfall, because it consistently underestimates the rainfall. PERSIANN's and GSMaP's poor performance in terms of FAR is associated with the overestimation of rainfall in the nonrainy season.
Finally, the error between the SPPs and the rain gauges interpolated rainfall for each time step and grid cell is calculated, and the correlation matrix is derived. At the hourly temporal scale, the median correlation of the errors across all grid cells among the different pairs of SPPs varies from 0.40 up to 0.64, besides EARLY and LATE pair that has 0.93. For coarser temporal scales the correlation increases (e.g., for daily it ranges from 0.58 to 0.74; EARLY and LATE have 0.98). This can be attributed to the fact that all SPPs make use of similar satellites and techniques in their algorithms. The high correlation between EARLY and LATE gives one more reason for excluding LATE from merging, since their errors and spatiotemporal behavior are quite similar, and a combination of them would have a limited added value. Moreover, the fact that the errors of the SPPs are not uncorrelated could raise questions regarding the use of the error variance technique. For overcoming this problem, there are techniques for either including the covariance matrix or transforming the data for decorrelation (Hasan et al. 2016). Nevertheless, in the study of Smith and Wallis (2009) that examined merging techniques for forecasting of macroeconomic time series it was concluded that it is better to neglect any covariance and calculate the weights based only on the meansquared error. Thus, the use of the error variance method is justifiable.

b. Bias correction
As expected, both techniques reduce the bias of the SPPs for all the study area, with the RBC method giving the best results (Fig. 8). CDF matching results in negative bias for all products, and this is related to the inaccurate estimation of low-intensity precipitation events by the SPPs. More specifically, the SPPs at the hourly temporal scale have more time steps (;90%) with no precipitation than the rain gauges (;85%), leading to accumulated loss of rainfall even after CDF matching. One way of overcoming this challenge is to implement the technique at coarser scales. With longer time series, the CDF matching could, for example, be performed on a daily time step and then disaggregated to hourly, according to the ratio between hourly and corresponding daily rainfall at each time step for the original SPPs. Therefore, the accumulated rainfall of the FIG. 6. Precipitation pattern during typhoon Nangka from (a1)-(a10) the rain gauges' interpolated rainfall and (b1)-(b10) EARLY on a 0.108 3 0.108 resolution grid. The white color indicates zero rainfall. The mean value of the precipitation for each subplot is presented on the associated box. (c) Spatial correlation of rainfall between EARLY and gauges for a range of spatial shifting combinations. EARLY is shifted from 1 up to 4 grid cells for each direction, resulting in 81 different combinations. The black line refers to the results from the original (nonshifted) location of the product, with the rest of the colors indicating a shift as per the legend. For example, the light green lines refer to the 16 combinations of shifting the product to the southeast (from 1 up to 4 grid cells in each direction). The dotted red rectangles depict the time steps when there is a high indication for shifting error for the SPP.
gauges that corresponds to the additional no-rainy time steps on the SPPs will be reduced and the relative bias will be improved. Nevertheless, by working on coarser temporal scales, there will be an increase in the latency. Moreover, since the SPPs cannot capture low intensities accurately, the disaggregation would cause an overestimation of the rainfall events. It is therefore challenging to improve the performance of the SPPs on fine temporal scales by employing the CDF matching technique.
Both techniques improve the spatial pattern and magnitude of the accumulated rainfall estimates, on the validation set as well, resulting in increased accuracy, especially for coarse scales (Table 6). For fine scales, though, the accuracy decreases in both the calibration and validation set. This can be attributed to the fact that the SPPs have very low correlation for the fine scales, thus it is very difficult to increase their performance after bias correction.
In particular, one of the limitations of the CDF matching method is that the pairing is not maintained, since the CDFs of the gauges and SPPs are constructed independently, without taking into consideration the time of each observation (Madadgar et al. 2014). This limitation is highly influencing the results of the CDF matching correction for GSMaP and PERSIANN that have systematic overestimation (underestimation) of the rainfall in the dry (wet) season. Thus, the seasonal bias would require seasonal adjustment. Nevertheless, this is not selected in this research due to the limited record length. Similarly, the RBC is performed on seasonal scale; thus, important variations that occur in finer temporal scales (e.g., monthly), as well as intensityinduced errors, are not corrected. Due to the small temporal coverage though, it is not advisable to work on monthly or finer scale for the RBC correction, or split the data based on intensity or precipitation type (typhoon, convective, high-low intensity, etc.). The method that performs the best for each SPP in the calibration set also performs best in the validation set, indicating that the selected bias correction technique should be decided based on the particular SPP and its characteristics.
One important observation is the mismatch of the results on the relative bias for the calibration set between the Kinu basin and the grid cells for the SPPs (Fig. 8; Table 6), something that also applies for the whole temporal coverage (Fig. 3; Table 3) as well as the validation set. It can be noticed, for example, that for the calibration set the original versions of GSMaP and PERSIANN have minor bias for the Kinu basin, although the actual bias for each grid cell is quite high, either positive or negative. These differences though, are effectively cancelled when the lumped rainfall is calculated for the basin. This result highlights once more, that the correction techniques should take into consideration the spatial behavior of the SPPs and not rely on spatially aggregated indicators that could provide misleading information.

c. Merging
This section presents the results of merging the SPPs for two different analyses. More specifically, the merging weights are calculated at the (i) hourly temporal scale for the dry season and (ii) daily temporal scale for the whole temporal coverage. As is mentioned on the methodology, the version of each SPP with the lowest NRMSE on the calibration set is selected for the merging (Table 7). LATE is not considered as explained in the previous sections. These two analyses are selected for the following reasons: (i) Hourly scale is the finest temporal scale that ensures the least possible latency of the merged products. In this scale nevertheless, the version that has the lowest NRMSE on the calibration set for all SPPs when the whole temporal coverage is analyzed, is the original version. Thus, the assumption of unbiased errors is not valid. Moreover, most of the rain of the wet season is associated with typhoons, where the SPPs have challenges. This, in combination with the small temporal coverage, results in high errors for very fine temporal scales (Table 3) and instability for coarser scales (Table 8). For the dry season though, there are no limitations related to typhoon-induced rainfall and, moreover, two SPPs have increased performance after bias correction. Although the assumption of unbiased errors is not The results refer to the unconditional analysis (P g $ 0). The best product for each temporal scale (bold) and the best version of each SPP on each temporal scale (italic) are indicated.  (Table 7) and thus comply with the assumption of unbiased errors of the individual products. It should be noticed that the best version on the validation set is not the same as the calibration, with possible reasons being the short temporal coverage and the inclusion of the typhoon events that the SPPs have challenges in depicting (Table 5).

CMORPH
The performance of the merged products on the hourly temporal scale and the dry season is improved compared to the individual SPPs. Especially for the Kinu basin spatial scale, the merging improves the results for fine temporal scales at both calibration and validation sets (Table 8). This is very important, because for such scales the bias correction techniques could not bring any additional improvement on the SPPs. On the calibration set, the best individual SPP for the fine scales is the original version of LATE. The IEVW product outperforms this SPP, resulting not only in improved accuracy, but also with reduced latency (LATE has 15 h and IEVW has ;5 h). Note that for coarse temporal scales, the merged products do not outperform the individual SPPs on the calibration set. This is because the version of each SPP used, as well as the variance of the errors for calculating the weights, are determined on the hourly scale, thus the weights are not the optimal ones for coarser scales. Similarly, the merging at daily level improves the results on both calibration and validation sets (Fig. 9) for the gridcell comparison, not only at the daily temporal scale, but also for all the coarser scales. All merging  (2015) and validation (2016) sets. Shown for each temporal scale is the best version of each individual SPP and the merged products (merging weights calculated at daily scale). The best version of each SPP is defined based on the lowest median NRMSE between the original and the two bias corrected version of the SPP for each temporal scale at the calibration set. The median value of the NRMSE from all the grid cells for each subplot is presented in the associated box. methods perform equally on the validation set and IEVW has the highest performance on the calibration set. This can be attributed to the used equations [Eqs. (7)- (9)]. More specifically, the IEVW method gives a wider range of values for the weights of each product at each grid cell. For the wet season, for example, the median weight of the grid cells for PERSIANN (the product with the lowest performance) is 0.11 and for EARLY (the product with the best performance) is 0.34 according to the IEVW. The error variance has weights of 0.22 and 0.26, respectively, while the average method scores 0.25 for all SPPs. Because the calibration set is quite limited and the performance of the SPPs between the calibration and validation set has differences, these more extreme values on the weights of the IEVW technique result in lower performance on the validation set. For larger datasets though, the weights of the SPPs would be more robust, and it is quite possible that the IEVW would outperform the rest of the merging methods, as already found in other studies (Khairul et al. 2018).

Conclusions and recommendations
This study evaluates the performance of five NRT SPPs in the Kinu basin region in Japan, including the use of two bias correction techniques and three merging methods that are applied taking into consideration the temporal and spatial variability on the SPPs' performance.
At this point, it is important to mention the limitations of this study. The main limitation is the short temporal coverage, which constrains the possibility for generalizing the conclusions and having robust results on the bias correction and merging steps. Moreover, this limitation creates subsequent challenges, as for example the inability to perform the bias correction techniques and merging methods with finer temporal splitting of the data. Nevertheless, this temporal coverage is chosen because it is a period that all selected SPPs have available data. One more limitation is the challenges of the SPPs in detecting typhoon-included rainfall. The three typhoon events that occurred in the area resulted in almost 20% of the total rainfall at Kinu basin. Thus, despite the small number of events, they substantially affect the evaluation of the SPPs and the performance of the selected methods on the analyzed temporal coverage and more importantly on the rainy season and the fine temporal scales. Finally, although the Kriging interpolation is quite robust and the gauge network is very extensive and of very high quality, there is still inherent uncertainty with every rainfall interpolation.
According to the results, the SPPs' performance depends on the selected temporal and spatial scale and increases with aggregation in time and/or space. Both bias-correction techniques are able to reduce the bias and improve the spatial representation of the rainfall, as well as, increase the performance of the products for coarse temporal scales. Finally, the merging of the SPPs improves the results for all temporal and spatial scales, showing that the products could be utilized not only in lumped hydrological models, but also in distributed ones.
This study, moreover, highlights some challenges in the usage of the various SPPs and methods. In the analyzed spatial and temporal coverage, all SPPs, especially PERSIANN, have difficulties in detecting typhooninduced rainfall, and additionally, spatial-shift error is identified for all SPPs that use data from LEO-PMW sensors and algorithms for rainfall propagation. Also, both bias-correction methods are in general unable to improve the results for finer than 6-h temporal scales. This creates an additional drawback when implementing the selected merging techniques at such temporal scales, where the original versions of the SPPs do not comply with the requirement of unbiased and normally distributed errors. Finally, the CDF matching, when applied at fine temporal scales, has challenges in eliminating the bias due to the significantly more time steps that the precipitation was estimated to be zero on the SPPs compared to the gauge observations.
Based on the results, there are avenues of future research and important research questions that need to be addressed in further studies. It is crucial to evaluate the performance of SPPs for basins of various sizes, especially in those areas affected by typhoons. The analysis can be expanded by comparing the SPPs via the simulation of the rainfall-runoff process through a hydrological model. In the future, similar studies with SPPs over extended temporal coverage should be taken up. Moreover, it is advisable to explore the performance of additional merging techniques, and the optimal method for a range of meteorological conditions and seasons. A comparison of an ensemble of SPPs together with gauge rainfall may be taken up as well. Last, it would be useful to identify the optimal combination of rain gauge and NRT satellite data to improve the accuracy of rainfall estimations for disaster risk reduction and water management tasks, such as flood and drought monitoring and early warning systems. The latter is of immense importance for countries with limited resources and poor ground observations that are nevertheless vulnerable to water-induced disasters. In such areas lies the most effective use of SPPs and any improvement in their applicability is very crucial. comments. Moreover, the first author would like to thank Mr. Jianning Ren for providing useful information regarding the analysis of the shifting error, Prof. Toshio Koike and Mr. Islam M. Khairul for the conducted discussions that contributed to the completion of this study, and Mr. Simon Parry for his fruitful suggestions for improving the quality of this manuscript.