## 1. Introduction

Two trends have emerged in the development of new streamflow forecasting systems: (i) a shift from deterministic to ensemble streamflow predictions (Alfieri et al. 2013; Cloke and Pappenberger 2009; Demargne et al. 2014; Thielen et al. 2009), and (ii) a move toward national/continental scale systems that attempt to describe hydrological fluxes for all reaches over a given domain (Adams and Pagano 2016; Bell et al. 2017; Emerton et al. 2016; Maxey et al. 2012). Meeting these twin aims offers clear benefits to forecast users: ensemble forecasts are usually more accurate than deterministic predictions and give an explicit estimate of forecast uncertainty (Gneiting and Katzfuss 2014; Gneiting et al. 2007), while extensive spatial coverage gives forecast information in reaches/basins where it was previously unavailable.

New Zealand’s National Institute of Water and Atmospheric Research (NIWA) is developing a national scale ensemble streamflow forecasting system for New Zealand, with the aim of informing water management and emergency agencies. New Zealand’s mountainous topography leads to precipitation that varies sharply in space and time, and in turn to rivers that can rise very quickly (Cattoën et al. 2016; Woods et al. 2006). These catchment characteristics provide similar challenges present in many mountainous regions (Rossa et al. 2011). To produce useful streamflow forecasts in New Zealand—particularly flood forecasts—requires (i) high-resolution precipitation forecasts to account for orographic effects due to steep mountains and (ii) hydrological models run at an hourly time step.

Rainfall forecasts are produced by NIWA at very high spatial resolution (1.5 km); however, the computational cost of this numerical weather prediction (NWP) model means that it can only produce deterministic forecasts. Fortunately, methods are available to produce ensemble precipitation forecasts through statistical calibration of deterministic NWP outputs (see review by Li et al. 2017). Statistical calibration offers the additional benefits of correcting biases and ensuring “coherence”—i.e., ensuring forecasts are at least as accurate as climatology forecasts (Zhao et al. 2017). These properties are essential prerequisites for using forecasts to force hydrological models.

A key requirement of statistical calibration is the availability of observations at the time step of interest—in our case, hourly precipitation. New Zealand has a sparse rainfall gauge network in relation to the very high spatial variability of rainfall in mountainous regions, meaning that hourly rainfall observations are not available at the national scale. There is a rain radar network covering much of New Zealand, but it is only available on a commercial basis so was not used in this study. Additionally, due to the complex terrain of many New Zealand catchments, radar accuracy can be degraded, and coverage significantly limited. However, daily precipitation data are available across New Zealand in the form of the interpolated and mass-corrected Virtual Climate Station Network (VCSN) dataset (Tait et al. 2006). The VCSN interpolates observed meteorological values onto a grid covering New Zealand at a 5-km spatial resolution at a daily time step. The mass correction is necessary to overcome underestimation of precipitation in mountainous regions (Andréassian et al. 2010; Bartolini et al. 2011; Beck et al. 2019; Hamon 1973; Valéry et al. 2010). The mass correction is performed by comparing rainfall and long-term streamflow records and correcting rainfall to ensure mass balance (Woods et al. 2006).

Lack of subdaily rainfall observations is a problem facing many regions where calibrated rainfall forecasts could be useful. Hourly precipitation datasets with extensive national or continental coverage are unusual—particularly in developing nations (Gruber and Levizzani 2008)—whereas daily precipitation datasets are more common and available over large domains [e.g., for the United States (Peterson et al. 1997), Canada (Vincent and Mekis 2006), and Australia (Jones et al. 2009) as well as global datasets (Beck et al. 2019)].

In this study, we aim to establish a new method to calibrate hourly precipitation forecasts from daily observations. At the time of writing, we are unaware of existing work addressing this issue. We base this on an existing calibration method (Robertson et al. 2013; Shrestha et al. 2015), which combines a Bayesian joint probability (BJP) model to calibrate forecasts with the Schaake shuffle (Clark et al. 2004) to order calibrated forecast ensemble members in space and time. We note that other reordering methods are available, notably ensemble copula coupling (ECC) (Schefzik et al. 2013). ECC is attractive in cases of data paucity because it differs from the Schaake shuffle in its choice of dependency template. For the Schaake shuffle, the template is chosen from past observations, whereas for ECC, the template is the uncalibrated ensemble forecast. For this paper, the uncalibrated forecast is deterministic and thus ECC could not be used.

We conduct three experiments: (i) calibrating to daily observations, and then disaggregating calibrated daily forecasts to hourly; (ii) synthesizing hourly observations from daily data using temporal and spatial patterns from the NWP, and then calibrating directly to these “pseudohourly” observations; and (iii) calibrating to both daily observations and to pseudohourly observations, and using calibrated daily rainfall forecasts to correct daily totals of pseudohourly calibrated forecasts. We compare these to the ideal case where hourly observations are available and demonstrate that the third experiment produces the best performing ensemble forecasts.

The paper is structured as follows: section 2 describes the catchment, observations, and NWP predictions used in this study. Section 3 describes the implementation of the BJP modeling approach for postprocessing subdaily rainfall predictions (control), three experiments to postprocess hourly forecasts with daily data, and the methods used to verify forecasts. Section 4 presents the results of forecast verification and compares the three experiments to the control obtained with hourly data. Section 5 discusses the potential limitations of the methods presented and identifies possible extensions. Section 6 summarizes and concludes the paper.

## 2. Catchment, data, and NWP model

### a. Catchment and data

We use the Hutt catchment in the Wellington region of New Zealand to test our method (Fig. 1). The elevation range of the catchment is large, with mountainous areas (the Tararua and Rimutaka ranges) in the northwest, and an extensive floodplain in the lower reaches. The catchment features a very steep precipitation gradient, with annual rainfall ranging from <900 to >5000 mm over its area of 558 km^{2} (Ballinger et al. 2011; Wellington Regional Council 1995) (Table 1).

Rainfall station information for the case study catchment during the 3-yr period May 2014–17.

While our method will eventually be deployed with the VCSN dataset, to test it thoroughly we require hourly observations as a benchmark. We therefore use gauged hourly precipitation for this study. The catchment is densely gauged: hourly observed precipitation data are available from 10 automatic meteorological stations with tipping-bucket rain gauges (Table 1). The tipping buckets are 0.5 mm in volume. For most of the stations an historical archive of hourly precipitation is available since 1972; however, we restrict the records to a 3-yr period, 2014–17, to match the precipitation forecast archive (section 2b). These three years include a moderately dry year (2014/15), a near-normal year (2016/17), and one with rainfall in the top 5% (2015/16), based on the past 40 years of records. Two out of the 10 rainfall stations have missing data ranging from 0.6% to 10.7% (Table 1).

As already noted, we wish to test our methods on daily observations in order to be compatible with the VCSN dataset. To produce daily observations from the precipitation gauging stations, we sum hourly data for each 24-h period beginning at 2100 UTC, the same 24-h aggregation period for which the VCSN observations are calculated.

### b. NWP model

Rainfall forecasts are generated by the New Zealand Convective Scale Model (NZCSM), a local implementation of the U.K. Met Office Unified Model System (UM), which has been run operationally since 2014. NZCSM is run as a deterministic model, with a grid resolution of 1.5 km and outputs archived at a 30-min time step. Forecasts are run to a lead time of 36 h. NZCSM takes its forcing from the New Zealand Limited Area Model (NZLAM), a regional NWP model run at a 12-km resolution that uses lateral boundary conditions from the global version of the UM run by the Met Office. NZCSM’s initial conditions are generated via a pseudo data assimilation scheme that optimally combines the large-scale features of the NZLAM forecast. The 1.5-km grid resolution of the NZCSM allows an accurate representation of the New Zealand topography, which is especially beneficial in mountainous regions. NZCSM forecasts are issued four times a day, at 0300, 0900, 1500, and 2100 UTC. To avoid ambiguity, we refer to each forecast issue time as a *cycle*, and define the cycles as 0300 cycle, 0900 cycle, 1500 cycle, and 2100 cycle.

An archive of real-time predictions for the ~3-yr period from 1 May 2014 to 31 May 2017 is available for this study (approximately 4500 forecasts, or ~1157 forecasts for each cycle). While a longer record is desirable, it is unavailable due to significant model upgrades to the NZCSM model in 2017, which substantially improved the outputs of the model.

To minimize potentially undesirable model spinup effects, we avoid the use of the first 6 h of the forecast in the calibration process, which include the forecast incremental analysis and pseudodata assimilation time period (Cattoën et al. 2016). Analysis of rainfall properties as a function of lead time shows that while most of the spinup effects are resolved within the first 3 h, there can still be some impact after 5 to 6 h (Cattoën et al. 2019). The 6-h offset was chosen also to better match the available daily observation period (0900–0900 local time).

A known problem with NZCSM predictions (and some other UM implementations; e.g., Stratton et al. 2018) is occasional predictions of unrealistically large rainfalls. The excess rain is caused by the model failing to conserve mass in certain circumstances. To remove unrealistic values, we fit a log-sinh (Wang et al. 2012) transformed normal distribution to the hourly forecasts for each combination of station, lead time, and issue cycle. We then compute the hourly forecast value with an exceedance probability of 1 in 100 years according to the fitted distribution. We refer to this value as *p*_{extreme}. Any forecast values greater than *p*_{extreme} are set to *p*_{extreme}. These adjustments are made to at most five forecast values (of ~1157) for each lead time. This method for dealing with unrealistically large rainfalls was used in Wang et al. (2019a,b).

## 3. Methods

### a. Forecast calibration

Our forecast calibration uses a censored bivariate normal distribution to relate transformed NWP rainfall forecasts to transformed observations (Robertson et al. 2013). The transformation, censored bivariate normal distribution, and the parameter estimation procedure are described in section 3a(1), section 3a(2), and appendix A, respectively. The calibrated forecasts are reordered with the Schaake shuffle [section 3a(4)], to ensure realistic temporal and spatial rank structures in the ensemble. This postprocessing method has been applied extensively in Australia (Shrestha et al. 2015), where it forms the basis of a preoperational ensemble streamflow forecasting system (Bennett et al. 2014).

#### 1) Data censoring and transformation

Deterministic rainfall forecasts *x* are rounded to the nearest 0.01 mm and a censor threshold of 0 mm is used for the parameter estimation procedure. For observations *y*, a censor threshold of 0.5 mm is applied, to be consistent with the tipping buckets used in this study of 0.5 mm in volume.

We first scale rainfall forecasts *x* and observations *y*:

where *x*_{max} and *y*_{max} are the maximum values of *x* and *y* over the full data period used.

The scaling forces the transformed forecasts and observations,

After scaling, the data are transformed with the log-sinh transformation (Wang et al. 2012) to normalize *x*′ and *y*′, and homogenize their variances:

where *a*_{x} and *b*_{x} are transformation parameters for *x*′, and *a*_{y} and *b*_{y} for *y*′.

#### 2) Bivariate normal distribution

We assume

where

and

where

A set of parameters

#### 3) Generating a calibrated forecast

Given a parameter set *θ*, we can define a univariate normal distribution:

with mean

and standard deviation

where

We draw *N* = 1000 random samples from (6) to produce an ensemble forecast

#### 4) Reordering calibrated forecasts

The calibration produces an ensemble forecast at each location and lead time but does not link these forecasts in space or time. We instill spatial and temporal properties in each forecast using the Schaake shuffle (Clark et al. 2004).

The procedure is as follows. We begin with an empty matrix *z*_{i,τ,t}, where *i* is an index of locations, *τ*_{h} is an index of forecast lead time with dimension length *L* = 36 h, and *t* is an index of observed dates. The *t* dimension has length *T* = *N* = 1000, corresponding to the number of ensemble members. We then fill this matrix with rainfall sequences from the historical record. For example, we put a sequence of observations at a given location *i* and starting at time *t* in the matrix: *i*th row. We then add the sequence for this same time period at each location and repeat the whole process for *T* different starting dates. If the available *T* date sequences are less than the number of members *N*, then we repeat the shuffling process *K* times, sampling *L* sequences of observations from the *T* available sequences, such that *L* < *T* and *L* × *K* = *N*.

Once we have filled *t* dimension for each location *i* and lead time *τ*:

with tied values (e.g., caused by the presence of zeros in the data) in *t* dimension. As with all applications of the Schaake Shuffle, the reordering process can be sensitive to the number of zeros present in the dependence template data (Bellier et al. 2017). We define an index matrix *r*_{i,τ,t}, at each location *i* and lead time *τ* by

The index matrix

After calibration we have an ensemble forecast at each location *i* and each lead time *τ*:

where *N* = 1000 is the size of the ensemble. Note that *n* dimension for each location *i* and lead time *τ*:

Tied values in

### b. Experimental design

To establish a new method to calibrate hourly precipitation forecasts from daily observations, we conduct three experiments as summarized in Fig. 2. We compare these to a control experiment, which represents the ideal case. The details are as follows.

Key methodological steps to generate calibrated ensemble forecasts with the control case and experiments 1, 2, and 3.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Key methodological steps to generate calibrated ensemble forecasts with the control case and experiments 1, 2, and 3.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Key methodological steps to generate calibrated ensemble forecasts with the control case and experiments 1, 2, and 3.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

#### 1) Control calibration

The control calibration represents the ideal case where high-quality hourly observations are available. Accordingly, forecasts are calibrated and shuffled with the method described in section 3a (with details in appendix A) using hourly precipitation observations.

#### 2) Experiment 1: Daily calibration

Assuming that observed hourly data are unavailable, we cannot rely on hourly data to inform our calibration. This experiment thus differs from the control in three ways. First, forecasts are calibrated at a daily time step instead of the hourly time step. Second, the Schaake shuffle uses the raw NZCSM hourly deterministic forecasts as template data for reordering the ensemble forecast. Third, calibrated daily forecasts are disaggregated to hourly as part of the shuffling process. The process is as follows.

##### (i) Daily parameter estimation and calibrated forecasts

In this study, we have 36-h forecast length and daily observations aggregated over the hours 2100–2100 UTC. This creates an obvious problem: how do you calibrate a 36-h forecast using 24-h aggregations? A solution is to match the aggregation period of observations with separate forecast issue cycles. This allows us to calibrate two periods: the first 24 h of the forecast (lead 1–24) and the last 24 h (lead 13–36) of the forecast. We then must decide which calibration period to use on the overlapping interval (lead 13–24): we use the calibration of the first 24 h for this period. The procedure is as follows.

We infer calibration parameters from the 2100 cycle and 0900 cycle forecasts, which correspond to the first and last 24-h period of each forecast. For a given cycle and location, denote the archive of NZCSM hourly forecasts by the matrix,

with dimensions *T* × *L* where *T* = 1157 forecasts and *L* = 36, as described in section 2b.

To calibrate the 2100 cycle forecasts, we sum forecasts in *τ*_{h}) dimension to create the vector

We then calibrate **D**^{2100} forecasts to daily observations to generate the parameter set *θ*_{D1}, following the calibration method in section 3a. To calibrate 0900 cycle forecasts, we sum forecasts in *τ*_{h}) dimension to create the vector

As with **D**^{2100}, we calibrate **D**^{0900} forecasts to daily observations to generate the parameter set *θ*_{D2}, following section 3a.

Even though the two parameter sets are generated from the 2100 and 0900 cycles, we apply the parameters to all cycles. The method presumes forecasts from different cycles will have similar properties at similar lead times (first 24 h and last 24 h). To do this, we must first sum our deterministic hourly forecasts into daily totals. For each cycle we produce the matrix

Matrix *T* × *L*_{D} where *L*_{D} = 2 lead time. The calibrated forecasts are unusual in that the summation periods for the two lead times overlap (for lead times of 13–24 h in *θ*_{D1} to *τ*_{D} = 1 and *θ*_{D2} to *τ*_{D} = 2. For each issue cycle *t*, this results in a matrix of calibrated ensemble forecasts *n* dimension is the ensemble size and is of length *N* = 1000. Note that

##### (ii) Ensemble reordering and hourly disaggregation

To produce spatially and temporally structured hourly ensemble members, we use the Schaake shuffle. However, as we assume hourly observations are not available for this experiment, we construct our template data from the hourly raw forecasts *N* = 1000. The procedure is detailed in appendix B.

We repeat the entire procedure of Eqs. (B1)–(B9) in appendix B for each forecast in

Note that we carry out this procedure in such a way that spatial and temporal patterns from the NZCSM forecasts are implicitly retained in the calibrated forecasts.

#### 3) Experiment 2: Pseudohourly calibration

As with the experiment 1, for experiment 2 we assume only daily observations are available for calibration. But instead of calibrating forecasts to daily observations and then applying a disaggregation, we first synthesize hourly “observations” (termed *pseudohourly* observations) and then calibrate the forecasts to the pseudohourly observations. To generate pseudohourly observations, we disaggregate daily observations with temporal and spatial patterns from the NWP (Fig. 3). The disaggregation follows these steps:

Hourly disaggregation process of daily observations to generate pseudohourly observations; hourly temporal patterns from the raw forecasts 0900 cycle are used here.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Hourly disaggregation process of daily observations to generate pseudohourly observations; hourly temporal patterns from the raw forecasts 0900 cycle are used here.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Hourly disaggregation process of daily observations to generate pseudohourly observations; hourly temporal patterns from the raw forecasts 0900 cycle are used here.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Matching a forecast to each daily observation (step 1) is complicated by the availability of forecasts from multiple cycles. In choosing which forecast cycle to use for the disaggregation, we do not wish to produce pseudohourly observations that match the timing of rainfalls in the raw forecasts too closely. If correlations between forecasts and pseudohourly observations are unrealistically high, this will cause the BJP to underestimate the true uncertainty in the forecast. For example, if we calibrate forecasts issued for the 1500 cycle against pseudo observations disaggregated from forecasts from the 1500 cycle (i.e., the same forecasts) this will result in much higher correlations between forecasts and the pseudo observations (and hence our calibrated ensemble would be too narrow) than would be expected if we calibrated forecasts against gauged observations. Thus, we must calibrate forecasts from a given cycle against pseudo observations disaggregated with forecasts from a different cycle. Given these constraints, we first choose NWP forecasts issued at the 0900 cycle to produce pseudohourly observations. We use these pseudohourly observations to calibrate forecast cycles 0300, 1500, and 2100. To calibrate the 0900 cycle forecasts, however, we use pseudohourly observations disaggregated with the 1500 cycle.

#### 4) Experiment 3: Daily member matching calibration (DMM)

As we will show, experiment 2 generally produced more reliable hourly forecasts than experiment 1. Conversely, experiment 1 produced more reliable forecasts of 24-h rainfall totals. In this experiment, we wish to combine the best aspects of experiments 1 and 2. To do this, daily accumulated rainfall forecasts from the pseudohourly and daily methods are ranked, matched, and then scaled. The procedure is as follows.

For a given forecast cycle, location, and issue cycle, *N* × *L*. We construct accumulated forecasts for

where *N* × *L*_{D} . We also retrieve

Next, we sort *n* dimension, to produce sorted matrices

We then compute scale factors to scale the pseudohourly forecast accumulations to match the daily accumulations in *n* = 1, 2, …, 1000 and each lead time *τ*_{D} = 1, 2, the scale factor is calculated by

where *ε* is a small positive number to avoid division by near-zero rainfalls. We tested different thresholds for *ε* = 0.05, 0.1, and 0.5 mm. We found that *ε* = 0.05 mm was the smallest threshold that maximized the number of forecasts to be scaled by the daily forecasts, while avoiding divisions by 0. We unsort the scaling factors with the index matrix

Hourly forecasts *τ*_{h}th row of

### c. Forecast verification by cross validation

We assess three key performance aspects of the ensemble rainfall forecasts: errors, bias, and reliability. Errors show the accuracy of the forecast, while bias indicates a general tendency to over or underpredict observations. Reliability indicates the appropriateness of the ensemble spread—i.e., ensemble spread is correctly distributed, and not too wide or too narrow. Bias is often considered a component of reliability—usually biased forecasts are not reliable. However, in the case of a highly skewed variables such as rainfall, a few outlying values can cause strong bias while forecasts can be reliable overall.

For each method and each station, the performance of rainfall forecasts is assessed against observed station data (available hourly). Performance is assessed at individual lead times, and for cumulative totals with 12-h accumulations (lead time 1–12, 13–24, and 25–36), with 24-h accumulations (lead time 1–24), and with 36-h accumulations (lead time 1–36). This enables us to independently assess the univariate calibration method and the reordering ensemble generation method.

We use a leave-one-month-out cross-validation procedure to ensure that the forecasts are verified independently of model fitting. For all methods, Bayesian joint probability parameters are inferred using all available data except one month. All the forecasts in that left-out month are then verified to the corresponding hourly station observations.

#### 1) Forecast reliability

We check forecast reliability with the probability integral transform (PIT). Given the cumulative distribution function of a forecast *F*_{t}, the PIT is given by

For a reliable set of forecasts, PIT values should be uniformly distributed. The treatment of PIT values at *y*(*t*) = 0 is necessary to allow reliable predictions to produce uniformly distributed PIT values when zero rainfalls occur (Wang and Robertson 2011).

We check uniformity by plotting PIT values as histograms. We calculate PIT values for individual lead times and for accumulated rainfalls. Forecasts of accumulated rainfalls can only be reliable if the ensemble has realistic spatial and temporal patterns.

#### 2) Forecast bias

We measure forecast bias with relative bias:

where

#### 3) Forecast accuracy

We measure errors in probabilistic forecasts with the continuous ranked probability score (CRPS). For a set of forecasts at *t* = 1, 2, …, *T*,

where *F*_{t} is the cumulative distribution function (CDF) of the forecast distribution, and *H* is the Heaviside step function.

CRPS reduces to the mean absolute error (MAE) for deterministic predictions, allowing us to compare errors in uncalibrated deterministic forecasts to errors in calibrated probabilistic forecasts. CRPS is negatively oriented: smaller scores indicate better forecasts, with zero being a perfect forecast. We use bootstrap resampling to assess the significance of reduction in CRPS error relative to the raw forecasts.

## 4. Results

### a. Forecast reliability

Figure 4 presents PIT histograms calculated for all gauge stations at individual lead times. The control method generates PIT histograms that are close to ideal. The pseudohourly forecasts produce a slight peak to the right of the histogram, indicating a faint negative bias. This negative bias is exacerbated somewhat in the daily forecasts. The DMM method produces results that combine aspects of the pseudohourly and daily methods: largely reliable, with slight evidence of negative bias. Slight overpopulation in the last bin of the hourly histograms (Fig. 4) can be observed for all sites; these are due to missed rainfall events from inaccurate timing or underestimation in forecast rainfall. We note that poor NWP timing can be amplified in our method because both the pseudohourly observations and forecasts rely on the timing of NWP forecast rainfall being right.

Hourly PIT histograms for each method and all sites as a function of selected lead times.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Hourly PIT histograms for each method and all sites as a function of selected lead times.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Hourly PIT histograms for each method and all sites as a function of selected lead times.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Figure 5 presents PIT histograms of rainfall accumulations for selected 12-, 24-, and 36-h totals. The daily method produces forecast accumulations that are almost perfectly reliable, especially for the 24-h periods. Conversely, PIT histograms for the control and pseudohourly method deviate from the horizontal line with lower counts for higher PIT values. Scaling pseudohourly calibrated forecasts using daily calibrated forecasts significantly improves reliability of accumulations; PIT histograms from the DMM method display nearly flat histograms compared to those of the pseudohourly method.

Accumulated PIT histograms for each method and all sites for 12-, 24-, and 36-h totals.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Accumulated PIT histograms for each method and all sites for 12-, 24-, and 36-h totals.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Accumulated PIT histograms for each method and all sites for 12-, 24-, and 36-h totals.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

We believe the imperfect reliability of accumulated forecasts from the control and pseudohourly methods to be related to the mismatch in autocorrelation of gauged rainfall and raw NWP forecasts. NWP forecasts tend to vary much more smoothly in time than gauged rainfalls, i.e., NWP forecasts are more autocorrelated than gauged observations. This appears to cause rainfall accumulations to be underconfident even though forecasts at individual lead times are highly reliable. We have identified this issue in other work and are currently investigating the exact cause. In previous applications (Bennett et al. 2014; Robertson et al. 2013; Shrestha et al. 2015), the BJP calibration was applied to areally averaged rainfalls, which tend to exhibit similar autocorrelation to NWP rainfall. In these previous studies, calibrated forecasts of accumulated rainfalls were highly reliable. As the aim of this present study is to generate a method that will ultimately be applied to an areally averaged rainfall product (VCSN) rather than to gauged rainfalls, this is not a major failing of the method applied here. We also note that the DMM largely resolves this issue by matching ensemble daily precipitation totals to values calibrated to daily observations.

Additionally, the spatial dependence structure of ensemble forecasts is preserved in the DMM method with the Schaake shuffle [section 3a(4) and appendix B]. This is illustrated by the PIT plots of spatial catchment average for both hourly lead times and accumulations of 12, 24, and 36 h, provided in the online supplemental material (Figs. S1 and S2).

### b. Forecast bias

We assess forecast bias for each method by presenting boxplots of the mean forecast bias values over the different sites.

Figure 6 presents hourly bias in the raw NWP and calibrated forecasts. Calibrated forecasts have markedly smaller bias than the raw forecasts. The control method displays little bias (close to zero) at all lead times. This is to be expected: by construction the BJP method optimizes parameters to produce unbiased forecasts. Pseudohourly forecasts tend to be positively biased, although the biases are reasonably small, particularly in contrast to the daily method. The daily method produces forecasts that exhibit strong bias for all sites, with up to 40% negative bias at early and late lead times and up to 40% positive bias around lead time 18–20 h. This is because the calibration minimizes bias at the daily time step but is given no information to minimize biases at subdaily lead times. Note that the strong bias sometimes evident in the daily method (Fig. 6) does not always manifest strongly in the PIT histograms (Fig. 4). This is because of the strongly skewed nature of rainfall: a small number of very large differences between observations and forecasts are sufficient to cause large biases (Fig. 6). However, because these instances are few, they are not strongly evident in the PIT histograms (Fig. 4). The DMM method produces smaller biases than either the pseudohourly or daily method, and biases are fairly consistent across all lead times.

Hourly relative bias for the deterministic NWP and postprocessed forecasts, for each method and across all sites at lead times 1, 6, 12, 18, 24, 30, and 36 h. Unbiased forecasts lie along the dashed line. The box is drawn between the 25th and 75th percentiles, with a line indicating the median. The whiskers extend above and below the box to the most extreme data points that are within a distance to the box equal to 1.5 times the interquartile range (Tukey boxplot). Points outside the whisker ranges are plotted.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Hourly relative bias for the deterministic NWP and postprocessed forecasts, for each method and across all sites at lead times 1, 6, 12, 18, 24, 30, and 36 h. Unbiased forecasts lie along the dashed line. The box is drawn between the 25th and 75th percentiles, with a line indicating the median. The whiskers extend above and below the box to the most extreme data points that are within a distance to the box equal to 1.5 times the interquartile range (Tukey boxplot). Points outside the whisker ranges are plotted.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Hourly relative bias for the deterministic NWP and postprocessed forecasts, for each method and across all sites at lead times 1, 6, 12, 18, 24, 30, and 36 h. Unbiased forecasts lie along the dashed line. The box is drawn between the 25th and 75th percentiles, with a line indicating the median. The whiskers extend above and below the box to the most extreme data points that are within a distance to the box equal to 1.5 times the interquartile range (Tukey boxplot). Points outside the whisker ranges are plotted.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Figure 7 presents bias of rainfall accumulations for the raw and calibrated forecasts. Overall bias of accumulated forecasts is smaller than bias for individual lead times. This is because errors at individual lead times tend to compensate for each other in the accumulated totals. Calibrated forecast bias is significantly smaller than raw forecast bias for 12-, 24-, and 36-h accumulations. The smallest accumulated biases across all sites are for forecasts from the control method. These are centered around zero and have a narrow spread across sites. Forecasts using the pseudohourly method consistently overforecast accumulated precipitation by 10%. The daily calibration produces forecasts with bias centered around zero for the 24-h accumulation. This is expected, as this method calibrates forecasts directly to daily data. The DMM method fulfils its objective by improving the performance of the pseudohourly method: mean forecast bias of precipitation accumulation is small and centered around zero.

Accumulated relative bias for the deterministic NWP and postprocessed forecasts, for each method and across all sites, as a function of 12-, 24-, and 36-h accumulations. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Accumulated relative bias for the deterministic NWP and postprocessed forecasts, for each method and across all sites, as a function of 12-, 24-, and 36-h accumulations. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Accumulated relative bias for the deterministic NWP and postprocessed forecasts, for each method and across all sites, as a function of 12-, 24-, and 36-h accumulations. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

### c. Forecast accuracy

We assess forecast accuracy for each method by presenting boxplots of the mean forecast CRPS and MAE values over the different sites.

Calibrated forecasts have substantially lower average errors than the raw NWP predictions at all sites and lead times (Fig. 8). As expected, calibrated forecasts using the control method have the lowest errors, followed very closely by forecasts using the pseudohourly and DMM methods. The daily method produces the worst accuracy of the calibrated forecasts at individual lead times.

Hourly MAE for raw deterministic NWP forecasts and CRPS for postprocessed forecasts for each method and across all sites, as a function of lead time 1, 6, 12, 18, 24, 30, and 36 h. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Hourly MAE for raw deterministic NWP forecasts and CRPS for postprocessed forecasts for each method and across all sites, as a function of lead time 1, 6, 12, 18, 24, 30, and 36 h. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Hourly MAE for raw deterministic NWP forecasts and CRPS for postprocessed forecasts for each method and across all sites, as a function of lead time 1, 6, 12, 18, 24, 30, and 36 h. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Figure 9 presents CRPS and MAE of accumulated rainfalls summarized for all sites. All calibration methods outperform raw forecasts for all sites, and all offer similar performance. Interestingly, forecasts based on the DMM method have the highest accuracy for the 36-h accumulation, though the difference in accuracy between all calibration methods is very small.

Accumulated MAE for raw deterministic NWP forecasts and CRPS for postprocessed forecasts for each method and across all sites, as a function of 12-, 24-, and 36-h accumulations. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Accumulated MAE for raw deterministic NWP forecasts and CRPS for postprocessed forecasts for each method and across all sites, as a function of 12-, 24-, and 36-h accumulations. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

Accumulated MAE for raw deterministic NWP forecasts and CRPS for postprocessed forecasts for each method and across all sites, as a function of 12-, 24-, and 36-h accumulations. Quantiles in boxplots are as for Fig. 6.

Citation: Journal of Hydrometeorology 21, 7; 10.1175/JHM-D-19-0246.1

For both hourly and accumulated forecasts, all the calibration methods lead to statistically significant reductions in error relative to the raw forecasts. This is illustrated in Fig. S3 for the DMM method.

## 5. Discussion

The effectiveness of the daily member matching DMM method (experiment 3) is due to the combination of the best aspects of the daily method (experiment 1) and the pseudohourly method (experiment 2). The DMM method produces reliable and bias free accumulated forecasts (a property of the daily calibration) without a bias pattern at hourly lead times (a property of the pseudohourly calibration).

As with other postprocessing methods, the DMM calibration requires a reasonable size of template data and forecast archive, often a challenge due to limited availability of a homogeneous NWP forecast archive. From a hydrological perspective, a 3-yr archive is a short record to establish a climatology of observed rainfall and space–time patterns for the Schaake shuffle. In addition, extreme rainfall may be missed, affecting the calibration for extreme events. Long reforecast archives are very valuable for detecting and correcting systematic errors in forecasts, especially forecasts of relatively rare events (Hamill et al. 2013). Longer reforecast archives also make it simple to generate longer records of template data for ensemble reordering, which better reflect the full historical range of spatiotemporal precipitation patterns.

A key assumption in the calibration method is that the NWP characterizes the spatial and temporal patterns of rainfall well. NWP spatial and temporal patterns underpin the pseudohourly observations, used as the “truth” to which forecasts are calibrated, as well as the basis of the Schaake shuffle. NWP models often differ from observations in crucial ways: for example, there may be a mismatch in diurnal patterns (Shrestha et al. 2015; Surcel et al. 2010). In these cases, the BJP method may overestimate true correlations between observations and forecasts, because the pseudohourly observations are much more like the forecasts than actual observations. This can lead the calibration to amplify overestimation or underestimation in the raw forecasts, causing the pseudohourly method to produce biases at individual lead times. Given these difficulties, we do not recommend the use of the pseudohourly method on its own. We reiterate, however, that the DMM method successfully mitigates these problems.

Future work could include using reordering methods with preferential selection of past observations having similar atmospheric states than current forecasts (Schefzik 2016; Scheuerer et al. 2017). This could improve the predicted hourly temporal structure as the Schaake shuffle would be informed by a more representative sample of historic events. These could be stratified using meteorological analogs (Bellier et al. 2017) and citations therein) and could be particularly valuable when separating stratiform from convective precipitation as these have distinct temporal patterns. Although our calibration method is applied to a raw deterministic NWP, it could be applied to a raw ensemble NWP with the application of the ECC to preserve spatial and temporal dependency structure in calibrated forecasts (Schefzik et al. 2013).

Joint or individual calibration of other variables (e.g., temperature) may be required for developing a national-scale flow forecasting system in New Zealand (Monhart et al. 2019). For example, snow and glacier melt is an important contributor to runoff in many rivers. A national-scale calibration approach may need special handling of distant station or grid points for the ensemble reordering aspect and may face computational constraints associated with a larger domain.

## 6. Summary

This study establishes a new method (daily member matching or DMM) to calibrate hourly precipitation forecasts from daily observations. The DMM method combines a daily calibration approach with an hourly calibration approach using hourly forecast patterns, by matching daily ensemble forecast values. The method is evaluated for ten stations in a catchment in New Zealand with steep rainfall gradients. The method performs similarly well to an ideal case where hourly data are available: calibrated forecasts have much lower bias and substantially smaller errors than the raw forecasts. In addition, the method produces reliable forecasts at individual lead times and for forecasts of precipitation accumulations.

Generating a statistically calibrated ensemble forecast from deterministic NWP predictions and daily data is likely to be of significant benefit for the expansion of streamflow forecasting services. Deterministic forecasts are routinely available (in New Zealand and elsewhere) at subdaily time steps (sometimes even subhourly) while daily precipitation observation datasets are much more common than subdaily datasets, and often available over large domains (national or continental scales). Scarcity of hourly observations is a problem in many regions, and particularly in developing countries.

Reliable, accurate and bias-free forecasts of catchment-scale precipitation are required to produce useable streamflow forecasts. This study is an important step toward the development of a national scale flow forecasting system in New Zealand to support a range of emergency services and water managers.

## Acknowledgments

The authors gratefully acknowledge various parties for assistance in providing data (in particular the Greater Wellington Regional Council). This research was funded by the New Zealand Ministry of Business, Innovation and Employment Natural Hazards Research Platform under contract C05X0907/Subcontract 2017-NIW-03-NHRP; by NIWA through the Resilience to Hazards Research Programme. The authors wish to acknowledge the contribution of NeSI to the results of this research. New Zealand’s national compute and analytics services and team are supported by the New Zealand eScience Infrastructure (NeSI) and funded jointly by NeSI’s collaborator institutions and through the Ministry of Business, Innovation and Employment (http://www.nesi.org.nz). Figures in this paper were produced using the Gramm MATLAB package (Morel 2018). The editor and the two reviewers are gratefully acknowledged for their valuable and thoughtful feedback.

## APPENDIX A

### Parameter Estimation Procedure

Parameter estimation is carried out in stages. In the first stage, the transformation and marginal distribution parameters *a*, *b*, *μ*, *σ* are estimated separately for observations and forecasts. We will describe the procedure to estimate these parameters for forecasts. An identical estimation is carried out for observations. To ease inference, we reparameterize and infer *θ*_{x} is given by

where *p*(*θ*_{x}) is the prior distribution and *p*(**x**|*θ*_{x}) is the likelihood. The likelihood is given by

where

The prior is given by

where

The priors for *a*_{x}, *a*_{x} ≤ 1 because for values of *a*_{x} > 1 the log-sinh transformation has little effect on the skewness of data. The prior on log(*b*_{x}) is informative, and encourages *b*_{x} to be close to 1 if the data are not strongly skewed.

The maximum posterior density of Eq. (A1) is found with the shuffled complex evolution (SCE) algorithm (Duan et al. 1992).

As noted above, the process of finding *θ*_{x} [Eqs. (A1)–(A5)] is repeated for observations. Once transformation and marginal distribution parameters are estimated for both observations and forecasts, *θ*_{x,y} are fixed.

The second stage is to estimate the correlation parameter

to give *θ*_{x,y} is fixed. The posterior density of *θ* is given by

where **y**′. Because of the presence of zeros in both observations and forecasts, the likelihood in Eq. (A7) must consider four cases:

where the Jacobian for forecast values is given by

and all other terms are as defined earlier. We do not impose an informative prior on

As with the transformation and marginal distribution parameters, we maximize the posterior density [Eq. (A7)] using the SCE algorithm.

## APPENDIX B

### Daily Calibration: Ensemble Reordering and Hourly Disaggregation

The daily calibration produces a matrix of daily accumulated and calibrated ensemble forecasts

We therefore construct our template data from the hourly forecasts *t* is excluded if *T* = 250 by randomly removing excess forecasts.

We are now left with a subset of forecasts *T* × *L*_{D}:

where *τ*_{D}, we sort *T* dimension to give

Accumulated forecasts *τ*_{D}th column of *T* = 1157) × (*L*_{D} = 2). We define two index matrices

where **r**_{1} and **r**_{2}, which map the unsorted forecasts in *τ*_{D} = 1 and *τ*_{D} = 2, respectively. Following the Schaake shuffle [see Eq. (13) in section 3a(4)], _{1} and _{2} are used to reorder the forecast *N* = 1000 is larger than the available template data size, we sample a first set of 250 members from *τ*_{D}, we sort

We disaggregate

We now have hourly forecasts

We then multiply the calibrated daily totals by the weights to produce ranked calibrated hourly forecasts:

The matrix

We now have a matrix *T* = 250 members.

Equations (B1)–(B8) are carried out four times, each time with a different 250 forecasts that are randomly sampled (without replacement) from

We now have four calibrated, shuffled, and disaggregated forecast matrices, each containing 250 ensemble members and lead times of 1–36 h. We concatenate these along the ensemble dimension to create a forecast of 1000 ensemble members:

For a given cycle, we repeat the entire procedure of Eqs. (B1)–(B9) for all 1157 forecasts in

## REFERENCES

Adams, T. E. I., and T. C. Pagano, 2016:

. Academic Press, 478 pp.*Flood Forecasting: A Global Perspective*Alfieri, L., P. Burek, E. Dutra, B. Krzeminski, D. Muraro, J. Thielen, and F. Pappenberger, 2013: GloFAS – Global ensemble streamflow forecasting and flood early warning.

, 17, 1161–1175, https://doi.org/10.5194/hess-17-1161-2013.*Hydrol. Earth Syst. Sci.*Andréassian, V., C. Perrin, E. Parent, and A. Bárdossy, 2010: The Court of Miracles of Hydrology: Can failure stories contribute to hydrological science?

, 55, 849–856, https://doi.org/10.1080/02626667.2010.506050.*Hydrol. Sci. J.*Ballinger, J., B. Jackson, A. Reisinger, and K. Stokes, 2011:

. Victoria University of Wellington, 40 pp.*The Potential Effects of Climate Change on Flood Frequency in the Hutt River*Bartolini, E., P. Allamano, F. Laio, and P. Claps, 2011: Runoff regime estimation at high-elevation sites: A parsimonious water balance approach.

, 15, 1661–1673, https://doi.org/10.5194/hess-15-1661-2011.*Hydrol. Earth Syst. Sci.*Beck, H. E., E. F. Wood, M. Pan, C. K. Fisher, D. G. Miralles, A. I. J. M. Dijk, T. R. McVicar, and R. F. Adler, 2019: MSWEP V2 global 3-hourly 0.1° precipitation: Methodology and quantitative assessment.

, 100, 473–500, https://doi.org/10.1175/BAMS-D-17-0138.1.*Bull. Amer. Meteor. Soc.*Bell, V. A., H. N. Davies, A. L. Kay, A. Brookshaw, and A. A. Scaife, 2017: A national-scale seasonal hydrological forecast system: Development and evaluation over Britain.

, 21, 4681–4691, https://doi.org/10.5194/hess-21-4681-2017.*Hydrol. Earth Syst. Sci.*Bellier, J., G. Bontron, and I. Zin, 2017: Using meteorological analogues for reordering postprocessed precipitation ensembles in hydrological forecasting.

, 53, 10 085–10 107, https://doi.org/10.1002/2017WR021245.*Water Resour. Res.*Bennett, J. C., D. E. Robertson, D. L. Shrestha, Q. J. Wang, D. Enever, P. Hapuarachchi, and N. K. Tuteja, 2014: A System for Continuous Hydrological Ensemble Forecasting (SCHEF) to lead times of 9 days.

, 519, 2832–2846, https://doi.org/10.1016/j.jhydrol.2014.08.010.*J. Hydrol.*Cattoën, C., H. McMillan, and S. Moore, 2016: Coupling a high-resolution weather model with a hydrological model for flood forecasting in New Zealand.

, 55 (1), 1–23.*J. Hydrol.*Cattoën, C., S. Moore, and T. Carey-Smith, 2019: Enhanced probabilistic flood forecasting using optimally designed numerical weather prediction ensembles. Natural Hazards Research Platform Contest 2017, 42 pp., https://www.naturalhazards.org.nz/haz/content/download/14088/74777/file/NHRP%20Contest%202017%20Cattoen%20Final%20Report.pdf.

Clark, M., S. Gangopadhyay, L. Hay, B. Rajagopalan, and R. Wilby, 2004: The Schaake shuffle: A method for reconstructing space-time variability in forecasted precipitation and temperature fields.

, 5, 243–262, https://doi.org/10.1175/1525-7541(2004)005<0243:TSSAMF>2.0.CO;2.*J. Hydrometeor.*Cloke, H. L., and F. Pappenberger, 2009: Ensemble flood forecasting: A review.

, 375, 613–626, https://doi.org/10.1016/j.jhydrol.2009.06.005.*J. Hydrol.*Demargne, J., and Coauthors, 2014: The science of NOAA’s operational Hydrologic Ensemble Forecast Service.

, 95, 79–98, https://doi.org/10.1175/BAMS-D-12-00081.1.*Bull. Amer. Meteor. Soc.*Duan, Q. Y., S. Sorooshian, and V. Gupta, 1992: Effective and efficient global optimization for conceptual rainfall-runoff models.

, 28, 1015–1031, https://doi.org/10.1029/91WR02985.*Water Resour. Res.*Emerton, R. E., and Coauthors, 2016: Continental and global scale flood forecasting systems.

, 3, 391–418, https://doi.org/10.1002/wat2.1137.*Wiley Interdiscip. Rev.: Water*Gneiting, T., and M. Katzfuss, 2014: Probabilistic forecasting.

, 1, 125–151, https://doi.org/10.1146/annurev-statistics-062713-085831.*Annu. Rev. Stat. Appl.*Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness.

, 69B, 243–268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.*J. Roy. Stat. Soc.*Gruber, A., and V. Levizzani, 2008: Assessment of global precipitation products. WCRP Series Rep. 128 and WMO/TD-1430, 55 pp., http://www.wcrp-climate.org/documents/AssessmentGlobalPrecipitationReport.pdf

Hamill, T. M., G. T. Bates, J. S. Whitaker, D. R. Murray, M. Fiorino, T. J. G. Jr, Y. Zhu, and W. Lapenta, 2013: NOAA’s second-generation global medium-range ensemble reforecast dataset.

, 94, 1553–1565, https://doi.org/10.1175/BAMS-D-12-00014.1.*Bull. Amer. Meteor. Soc.*Hamon, W. R., 1973: Computing actual precipitation. Distribution of precipitation in mountainous areas, Vol. 1, WMO Rep. 362, 159–174.

Jones, D., W. Wang, and R. Fawcett, 2009: High-quality spatial climate data-sets for Australia.

, 58, 233–248, https://doi.org/10.22499/2.5804.003.*Aust. Meteor. Oceanogr. J.*Li, W., Q. Duan, C. Miao, A. Ye, W. Gong, and Z. Di, 2017: A review on statistical postprocessing methods for hydrometeorological ensemble forecasting.

, 4, e1246, https://doi.org/10.1002/WAT2.1246.*Wiley Interdiscip. Rev.: Water*Maxey, R., M. Cranston, A. Tavendale, and P. Buchanan, 2012: The Use of deterministic and probabilistic forecasting in Countrywide Flood Guidance in Scotland.

*11th BHS National Symp.*, University of Dundee, Dundee, United Kingdom, British Hydrological Society, 7 pp.Monhart, S., M. Zappa, C. Spirig, C. Schär, and K. Bogner, 2019: Subseasonal hydrometeorological ensemble predictions in small- and medium-sized mountainous catchments: Benefits of the NWP approach.

, 23, 493–513, https://doi.org/10.5194/hess-23-493-2019.*Hydrol. Earth Syst. Sci.*Morel, P., 2018: Gramm: Grammar of graphics plotting in Matlab.

, 3, 568, https://doi.org/10.21105/joss.00568.*J. Open Source Software*Peterson, T., H. Daan, and P. Jones, 1997: Initial selection of a GCOS surface network.

, 78, 2145–2152, https://doi.org/10.1175/1520-0477(1997)078<2145:ISOAGS>2.0.CO;2.*Bull. Amer. Meteorol. Soc.*Robertson, D. E., D. L. Shrestha, and Q. J. Wang, 2013: Post-processing rainfall forecasts from numerical weather prediction models for short-term streamflow forecasting.

, 17, 3587–3603, https://doi.org/10.5194/hess-17-3587-2013.*Hydrol. Earth Syst. Sci.*Rossa, A., K. Liechti, M. Zappa, M. Bruen, U. Germann, G. Haase, C. Keil, and P. Krahe, 2011: The COST 731 Action: A review on uncertainty propagation in advanced hydro-meteorological forecast systems.

, 100, 150–167, https://doi.org/10.1016/j.atmosres.2010.11.016.*Atmos. Res.*Schefzik, R., 2016: A similarity-based implementation of the Schaake shuffle.

, 144, 1909–1921, https://doi.org/10.1175/MWR-D-15-0227.1.*Mon. Wea. Rev.*Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling.

, 28, 616–640, https://doi.org/10.1214/13-STS443.*Stat. Sci.*Scheuerer, M., T. M. Hamill, B. Whitin, M. He, and A. Henkel, 2017: A method for preferential selection of dates in the Schaake shuffle approach to constructing spatiotemporal forecast fields of temperature and precipitation.

, 53, 3029–3046, https://doi.org/10.1002/2016WR020133.*Water Resour. Res.*Shrestha, D. L., D. E. Robertson, J. C. Bennett, and Q. J. Wang, 2015: Improving precipitation forecasts by generating ensembles through postprocessing.

, 143, 3642–3663, https://doi.org/10.1175/MWR-D-14-00329.1.*Mon. Wea. Rev.*Stratton, R. A., and Coauthors, 2018: A Pan-African convection-permitting regional climate simulation with the Met Office Unified Model: CP4-Africa.

, 31, 3485–3508, https://doi.org/10.1175/JCLI-D-17-0503.1.*J. Climate*Surcel, M., M. Berenguer, and I. Zawadzki, 2010: The diurnal cycle of precipitation from continental radar mosaics and numerical weather prediction models. Part I: Methodology and seasonal comparison.

, 138, 3084–3106, https://doi.org/10.1175/2010MWR3125.1.*Mon. Wea. Rev.*Tait, A., R. D. Henderson, R. Turner, and X. Zheng, 2006: Thin plate smoothing spline interpolation of daily rainfall for New Zealand using a climatological rainfall surface.

, 26, 2097–2115, https://doi.org/10.1002/joc.1350.*Int. J. Climatol.*Thielen, J., J. Bartholmes, M. H. Ramos, and A. de Roo, 2009: The European flood alert system – Part 1: Concept and development.

, 13, 125–140, https://doi.org/10.5194/hess-13-125-2009.*Hydrol. Earth Syst. Sci.*Valéry, A., V. Andréassian, and C. Perrin, 2010: Regionalization of precipitation and air temperature over high-altitude catchments – Learning from outliers.

, 55, 928–940, https://doi.org/10.1080/02626667.2010.504676.*Hydrol. Sci. J.*Vincent, L. A., and É. Mekis, 2006: Changes in daily and extreme temperature and precipitation indices for Canada over the twentieth century.

, 44, 177–193, https://doi.org/10.3137/ao.440205.*Atmos.–Ocean*Wang, Q. J., and D. E. Robertson, 2011: Multisite probabilistic forecasting of seasonal flows for streams with zero value occurrences.

, 47, W02546, https://doi.org/10.1029/2010WR009333.*Water Resour. Res.*Wang, Q. J., D. L. Shrestha, D. E. Robertson, and P. Pokhrel, 2012: A log-sinh transformation for data normalization and variance stabilization.

, 48, W05514, https://doi.org/10.1029/2011WR010973.*Water Resour. Res.*Wang, Q. J., Y. Shao, Y. Song, A. Schepen, D. E. Robertson, D. Ryu, and F. Pappenberger, 2019a: An evaluation of ECMWF SEAS5 seasonal climate forecasts for Australia using a new forecast calibration algorithm.

, 122, 104550, https://doi.org/10.1016/j.envsoft.2019.104550.*Environ. Modell. Software*Wang, Q. J., T. Zhao, Q. Yang, and D. Robertson, 2019b: A seasonally coherent calibration (SCC) model for postprocessing numerical weather predictions.

, 147, 3633–3647, https://doi.org/10.1175/MWR-D-19-0108.1.*Mon. Wea. Rev.*Wellington Regional Council, 1995: Surface water hydrology. Vol. 1, Hydrology of the Hutt Catchment, Wellington Regional Council Rep., 196 pp.

Woods, R., J. Hendrikx, R. D. Henderson, and A. Tait, 2006: Estimating mean flow of New Zealand rivers.

, 45, 95–110.*J. Hydrol.*Zhao, T., J. C. Bennett, Q. J. Wang, A. Schepen, A. W. Wood, D. E. Robertson, and M.-H. Ramos, 2017: How suitable is quantile mapping for postprocessing GCM precipitation forecasts?

, 30, 3185–3196, https://doi.org/10.1175/JCLI-D-16-0652.1.*J. Climate*