## 1. Introduction

Numerical weather prediction (NWP) models are routinely run in major climate and weather centers throughout the world. The model outputs form an important basis for operational weather forecasts for the short term (e.g., 10 days ahead). For the model outputs to be informative, raw forecasts from the NWP models need to be postprocessed to generate calibrated forecasts, often by using statistical models (Li et al. 2017; Wilks and Hamill 2007).

Calibration aims to remove unconditional (overall) and conditional (e.g., magnitude dependent) biases of the raw forecasts (Li et al. 2017; Zhao et al. 2017). If the raw forecasts are deterministic, the calibration may statistically generate ensemble forecasts, with the ensemble spread representing forecast uncertainty (Hamill et al. 2008; Wu et al. 2011). If the raw forecasts are already ensembles, the calibration may tune the ensemble spread to make it reliable in representing forecast uncertainty (Raftery et al. 2005; Sloughter et al. 2007). The calibration also aims to maximumly extract forecast skill from NWP model outputs to give the most accurate forecasts possible.

Good statistical calibration models should also aim to produce forecasts that are coherent (Krzysztofowicz 1999; Zhao et al. 2017). Coherent calibrated forecasts are never less valuable to a user than a climatology prior distribution (Krzysztofowicz 1999). Therefore, as the skill of raw NWP output gets lower with increasing forecast lead time, the calibrated ensemble forecasts should increasingly approach the observed climatology. Many weather and climate variables display seasonal variation in climatology. Calibrated ensemble forecasts should reflect this seasonal variation in climatology so that they will produce, for example, more precipitation in a wet season and less in a dry season. In the literature, there is generally a lack of evaluation of calibration models on their ability to generate forecasts that are coherent with a seasonally varying climatology.

Statistical calibration models are established by using archived forecast (and/or hindcast) data along with observed data. As NWP models tend to be updated regularly, most climate and weather centers do not generate hindcasts with each update. However, there is usually a short period of experimental forecasts generated before a new model is made operational. That period may be just a few months to a year. Therefore, the forecast data for establishing calibration models for use with a newly operationalized NWP model are often very limited.

One approach to establish calibration models when data are limited is to neglect seasonality in the calibration models (Robertson et al. 2013; Shrestha et al. 2015). Such an approach may be acceptable when the underlying skill of raw forecasts is high. The seasonality is already embedded in the raw forecasts and will largely be transferred to the calibrated forecasts. However, as lead time increases and forecast skill decreases, the calibrated forecasts will increasingly approach the same climatology for all forecast issue dates. For locations with marked seasonality in climate, the calibrated ensemble forecasts will produce, for example, too little precipitation in a wet season and too much in a dry season. An alternative approach considers seasonality by using data from a short window preceding the forecast initialization date to establish calibration models (Gneiting et al. 2005; Raftery et al. 2005; Sloughter et al. 2007). This approach produces one calibration model for each forecast initialization date, and potentially overfits the available data.

The length of record used to establish a calibration model raises another climatology related issue—the period of observations required to establish a representative climatology. Using observations for the same period as the archive of raw forecast data (e.g., Robertson et al. 2013), or data for a short window preceding forecast initialization (Gneiting et al. 2005; Raftery et al. 2005) are both common practice for establishing calibration models. When the skill of the raw NWP forecasts is low, neither of these strategies is likely to generate a climatology-like forecast that reflects the long-term statistical characteristics of observations. This will be particularly true when the record of observations used to establish the calibration model is as short as one year or less.

In this study, we set out to develop a practical tool for generating calibrated forecasts that are free from bias, reliable in ensemble spread, as skillful as possible and coherent with respect to a seasonally varying climatology. We need to resolve three issues. The first is to construct a calibration model that is sophisticated enough to allow for seasonality in the statistical characteristics of raw forecasts and observations. The second is to bring into the calibration model a climatology that is representative of long-term statistical characteristics of observations. The third is to enable the calibration model to work effectively with very limited archived raw forecast data.

In this paper, we present how we resolve these three issues by developing a seasonally coherent calibration (SCC) model for postprocessing raw forecasts from NWP models. The SCC model formulated here is for the calibration of daily precipitation forecasts, but it can be easily adapted for other time resolutions and for other weather variables.

## 2. Model formulation

Here we consider deterministic NWP model forecasts. For a given raw forecast of precipitation, *x*(*t*), we wish to calibrate it to give a probabilistic forecast of what will be observed, *y*(*t*). The probabilistic forecast is to represent forecast uncertainty and will be in the form of ensemble members. To do this, we need to establish a calibration model. We will establish separate calibration models for different lead times and locations. Later in the paper, we will discuss how we can link the ensemble forecasts for different days ahead and for different locations.

Within a model for one lead time and location, we will distinguish some of the statistical characteristics of both raw forecasts and observations between different months of the year. For days within one month, we regard the statistical characteristics as uniform. We call the final model presented in this section a seasonally coherent calibration (SCC) model. An overview of the model formulation is given in Table 1.

An overview of model formulation.

### a. A joint probability model

*x*(

*t*) and

*y*(

*t*),

*t*= 1, 2, …,

*T*, we apply log–sinh transformations (Wang et al. 2012) to normalize these variables to

*f*(

*t*) and

*o*(

*t*). Details on the transformations including parameter estimation are given in the appendix. We assume

*f*(

*t*) and

*o*(

*t*) are drawn from a bivariate normal distribution:

*m*(

*t*) is a function that marks the month of the year for day

*t*,

*m*(

*t*) =

*k*∈ {1, 2, …, 12};

*μ*

_{f}[

*m*(

*t*)] and

*σ*

_{f}[

*m*(

*t*)] are the mean and standard deviation of the marginal distribution of

*f*(

*t*);

*μ*

_{o}[

*m*(

*t*)] and

*σ*

_{o}[

*m*(

*t*)] are the mean and standard deviation of the marginal distribution of

*o*(

*t*); and

*ρ*[

*m*(

*t*)] is the correlation between

*f*(

*t*) and

*o*(

*t*).

For month *k*, *μ*_{f}(*k*) and *σ*_{f}(*k*) characterize in the transformed domain the NWP model forecast climatology, and *μ*_{o}(*k*) and *σ*_{o}(*k*) the observed climatology.

The bivariate normal distribution applies to continuous variables, but here *f*(*t*) and *o*(*t*) are subject to thresholds because precipitation has a lower threshold of zero. To overcome this problem, we treat a threshold data value as a censored data record, with an unknown exact value that is equal or below the threshold (Wang and Robertson 2011). Furthermore, we will use nonzero thresholds *x*_{c} and *y*_{c} for *x*(*t*) and *y*(*t*), for better fitting *f*(*t*) and *o*(*t*) to bivariate normal distributions and dealing with the characteristics of the observation process (Robertson et al. 2013; Shrestha et al. 2015). The exact values used for the thresholds are specified in section 3c. The corresponding thresholds for *f*(*t*) and *o*(*t*) are *f*_{c} and *o*_{c}.

This model is similar to the rainfall postprocessor (RPP) model (Robertson et al. 2013), which is a simplified Bayesian joint probability (BJP) model (Wang and Robertson 2011; Wang et al. 2009). The key difference is that the RPP model uses only one set of parameter values for all months, while this model purposely makes distinctions among the months.

If a long period of data, say 10–20 years, is available for both *x*(*t*) and *y*(*t*), it is possible to estimate the parameters of the model month by month, yielding for each month accurate correlation estimate and representative long-term forecast and observed climatologies (Wilks and Hamill 2007; Wu et al. 2011). In practice, however, the archived data of a NWP model are usually available for only a short period, sometimes even less than one year. Therefore, a more sophisticated approach is needed for parameter estimation.

### b. Climatology of observations

For estimating *μ*_{o}(*k*) and *σ*_{o}(*k*), we do not need to restrict data use to only the period when archived NWP data is available. Here we use a long period of observation data, say the last 10–20 years. Denote the data series as *y*(*i*), *i* = 1, 2, …, *I*, and the corresponding transformed data series as *o*(*i*). We use the index *i* (rather than *t*) to mark the difference from the data series referred to in section 2a.

*o*(

*i*) is normal:

*k*can be written as

_{N}is the probability density function and cdf

_{N}the cumulative distribution function of a normal distribution.

A numerical search method, such as the Nelder–Mead (aka downhill simplex) (Nelder and Mead 1965), can be used to find the parameter values that maximize the log-likelihood function. The process is repeated for all 12 months.

### c. Reparameterization of the joint probability model

Having found a solution to estimating *μ*_{o}(*k*) and *σ*_{o}(*k*), we now turn our attention to estimating *μ*_{f}(*k*) and *σ*_{f}(*k*), and *ρ*(*k*). As noted before, with very limited archived NWP data, it is not possible to robustly estimate these parameters separately for each month. In fact, when the archived NWP data are absent for some months of the year, it is impossible to directly estimate the parameters for these months.

*k*= 1, 2, …, 12,

*b*≥ 0,

*c*≥ 0,

*d*≥ 0, and 0 ≤

*r*≤ 1.

This reparameterization replaces 36 parameters {*μ*_{f}(*k*) and *σ*_{f}(*k*), and *ρ*(*k*), *k* = 1, 2, …, 12} with just five parameters {*a*, *b*, *c*, *d*, and *r*}. It is now feasible to use a shorter period of archived NWP data for parameter estimation.

Equations (5) and (6) assume that the NWP model, if run long enough, can reproduce the correct pattern of seasonal variation in observed climatology, but some adjustments are needed to match in scale. Equation (7) assumes that the underlying forecast skill of the NWP model is constant throughout the year. The constraints of the parameter values are to prevent nonsensical relationships. In practice, the relationships are meant to be approximately correct only, serving to link up all 12 months parsimoniously for model parameter estimation. We will further investigate these relationships in our case study.

Equation (5) is to find the equivalent mean of the forecasts if forecast data were available for the same long-term observation period. The same can be said about Eq. (6) regarding standard deviation. Of course, individual events depart from the long-term values. These are handled through the joint probability distribution. Over the shorter period when forecast data are available, if forecast average is low, the observation average is expected to be low also. The same can be said about forecast variability and observation variability. When forecasts have little useful information (*r* approaching 0 such as at long lead times), the calibrated forecasts will approach long-term climatology of observations, rather than the distribution of observations over the shorter period. In other words, when the NWP forecasts are unable to give useful guidance on what to expect, we turn to long-term climatology of observations for guidance, rather than what happened in the recent short period.

*a*,

*b*,

*c*,

*d*, and

*r*}. Given data series,

*f*(

*t*) and

*o*(

*t*),

*t*= 1, 2, …,

*T*, the likelihood function can be written as

*μ*

_{f}[

*m*(

*t*)],

*σ*

_{f}[

*m*(

*t*)], and

*ρ*[

*m*(

*t*)] are functions of {

*a*,

*b*}, {

*c*,

*d*}, and {

*r*}, respectively, as in Eqs. (5)–(7).

As in section 2b, a numerical search method can be used to find the parameter values for {*a*, *b*, *c*, *d*, and *r*} that maximize the log-likelihood function. Fast algorithms are available to calculate the cdf_{N} and cdf_{BN} terms in Eq. (9). Note that for a given set of parameters, cdf_{BN} needs to be evaluated only once for each month of the year.

### d. Model use for calibrating new forecasts

*x*(

*t*), we want to derive a calibrated probabilistic forecast for

*y*(

*t*), in the form of an ensemble. Denoting the corresponding transformed variables as

*f*(

*t*) and

*o*(

*t*) as before, the distribution

*o*(

*t*) conditional on the known

*f*(

*t*) is

When *f*(*t*) > *f*_{c}, we draw random samples from this conditional distribution to form an ensemble of *o*(*t*).

When *f*(*t*) ≤ *f*_{c}, we first draw a random sample from the marginal distribution *f*_{c}]. We then draw a random sample from the conditional distribution of Eq. (14) to produce an ensemble member of *o*(*t*). We repeat these two steps multiple times to form an ensemble of *o*(*t*).

Finally, we need to convert all the ensemble members of *o*(*t*) to *y*(*t*). Denote *o*_{0} as the value of *o* corresponding to *y* = 0. When an ensemble member *o*(*t*) > *o*_{0}, we back-transform *o*(*t*) to *y*(*t*) by using Eq. (A6) in the appendix. When an ensemble member *o*(*t*) ≤ *o*_{0}, we set *y*(*t*) = 0.

From our previous experience in forecast calibration using a joint probability model together with variable transformations, unrealistically large calibrated forecasts may very occasionally occur. The problem is caused by extremely large raw forecast values from NWP models, so large that the transformed forecast *f*(*t*) is considered improbable according to the fitted climatology of transformed raw forecasts. We propose a pragmatic approach here to circumvent this problem.

*f*(

*t*) as established in section 2c, check if the following is true:

*P*

_{extreme}is a high nonexceedance probability to indicate an extreme threshold. If true, we set

*f*(

*t*) back to the extreme threshold:

_{N}is the inverse cumulative distribution function of a normal distribution. We then proceed as usual to draw random samples from the conditional distribution of Eq. (14) to form an ensemble of

*o*(

*t*). The exact value of

*P*

_{extreme}used in this study is specified in section 3a.

## 3. Case study

### a. Data

We select a location in north-eastern Australia where precipitation is strongly seasonal to demonstrate the performance of the SCC model formulation.

Hourly precipitation observations are obtained from the Bureau of Meteorology for the site 532060: Murray River at Murray Flats (18.0297°S, 145.9242°E), which forms part of the rainfall observation network supporting flood forecasting operations. Hourly precipitation observations are accumulated to daily totals to match the daily precipitation forecasts. For this study, observations for the period 2007–16 are used to represent the longer-term climatology of precipitation data.

We use precipitation forecasts from the global version of the Australian Community Climate and Earth System Simulator (ACCESS-G) NWP model. Specifically we use version one of the Australian Parallel Suite (APS1). APS1 ACCESS-G is based on the Met Office Unified Model/Variational System and has a horizontal resolution of approximately 40 km and 70 vertical levels. Forecasts are issued twice daily at 0000 and 1200 UTC for lead times of up to 240 h and precipitation forecasts are available at 3-h temporal resolution. Details of the model components and parameterizations can be found in (Bureau of Meteorology 2010, 2012).

For this study we use precipitation forecasts issued at 1200 UTC and aggregate 3-h totals to daily values. We use data readily available at the commencement of this study, specifically for the period 3-yr period covering 2011–13. We use a 3-yr period for this study to ensure that forecast evaluation results are robust and less likely to be subject to sampling errors.

### b. Checking model reparameterization assumptions

Here we go back to the initial joint probability model of Eq. (1). We will use the three years of data available for *x*(*t*) and *y*(*t*), *t* = 1, 2, …, *T*, to estimate model parameters month by month separately. It is true that the derived climatologies will not be representative of long-term statistics of station rainfall. Here the parameter values derived from three years of data are for examining the relative functional forms of parameter relationships as expressed in Eqs. (5)–(7).

*k*= 1, 2, …, 12, we apply the same procedure as in section 2b to estimate parameters

*μ*

_{f}(

*k*) and

*σ*

_{f}(

*k*) from

*f*(

*t*), and

*μ*

_{o}(

*k*) and

*σ*

_{o}(

*k*) from

*o*(

*t*),

*t*= 1, 2, …,

*T*. We then apply the method of maximum likelihood to estimate parameter

*ρ*(

*k*). The likelihood function can be written as

*l*(

*t*) is evaluated using the same equation as Eq. (9). We repeat the process for each of the forecast lead times.

We then display *μ*_{f}(*k*) versus *μ*_{o}(*k*), and *σ*_{f}(*k*) versus *σ*_{o}(*k*) to see if linear relationships as in Eqs. (5) and (6) are evident. We also display *ρ*(*k*) against *k* to see if there is any strong seasonal pattern to invalidate the assumption of a steady underlying correlation.

### c. SCC model setup and forecast evaluation

The SCC model is setup for application to the case study dataset. The censoring thresholds for *x*_{c} and *y*_{c} are set to 0.01 and 0.2 mm day^{−1}, respectively, based on Robertson et al. (2013). The extreme threshold for nonexceedance probability of predictor value is set to *P*_{extreme} = 1 − 0.01/365, which is equivalent to an average recurrence interval of 100 years. The size of the generated ensemble of calibrated forecasts is 1000 members. The model is applied to each of the 10 forecast lead times.

For forecast evaluation, we set up leave-one-month-out cross validation. For the 3-yr period when NWP raw forecast data are available, data for *x*(*t*) and *y*(*t*) from one of the 36 months are left out when estimating model parameters. The established model is subsequently applied to calibrate *x*(*t*) for the left-out month to produce calibrated ensemble forecasts. This process is repeated for all of the 36 months.

The calibrated ensemble forecasts are evaluated against observations *y*(*t*) for each month of the year on the following:

average monthly total precipitation,

average percentage of wet days,

daily precipitation distribution, and

forecast skill and reliability.

Details on the evaluation diagnostics used are explained in section 4b.

In the introduction, we point out the problem of not building in seasonal variation in climatology when establishing a calibration model. To demonstrate this point, we apply a joint probability model with a constant set of parameters for all year-round. We call this model a seasonally invariant calibration (SIC) model. In estimating the parameters of the SIC model, the procedure described in section 3b for one month can be applied here by treating data from all 12 months for *x*(*t*) and *y*(*t*) as if they were from one (big) month. For forecast evaluation, we use the same cross-validation setup as for the SCC model. The SIC model is very similar to the RPP model mentioned earlier, with minor differences in parameter estimation formulation.

## 4. Results

### a. Model reparameterization assumptions

Results of parameter values estimated month by month separately are shown in Figs. 1–3. The parameters representing the mean of forecasts and the mean of observations strongly follow a linear relationship (Fig. 1). The parameters representing the standard deviation of forecasts and the standard deviation of observations show weaker relationships with greater scatter around a linear relationship (Fig. 2). Statistically, standard deviations are subject to greater sampling variability than means. Therefore, the greater scatter seen in Fig. 2 is not unexpected. Use of linear relationships across months may help reduce sampling errors. In both Figs. 1 and 2, the data positions for Day 6 are markedly different from other lead times. This is caused by differences in the transformation parameter values, possibly due to a few unusual raw forecasts that appear only for Day 6. To avoid this problem, one may adopt the same transformation parameter values for all lead times, but this approach was not investigated in our study.

Relationships between standard deviation of transformed forecasts and standard deviation of transformed observation.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Relationships between standard deviation of transformed forecasts and standard deviation of transformed observation.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Relationships between standard deviation of transformed forecasts and standard deviation of transformed observation.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Correlation coefficient vs month. Lines are averages for different lead times.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Correlation coefficient vs month. Lines are averages for different lead times.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Correlation coefficient vs month. Lines are averages for different lead times.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

The correlation between forecasts and observations fluctuates considerably with month (Fig. 3). In the last three months of the year, the correlation has values higher than the 12-month averages for respective lead times, but this pattern is well within the effects of sampling given the large month-to-month valuations. For this reason, the assumption of a constant underlying correlation throughout the year is considered reasonable and may in fact help reduce sampling errors. This assumption is further supported by the clear decreasing trend in average correlation with increasing lead time.

Overall, the assumptions as expressed by Eqs. (5)–(7) are supported by this analysis, and may lead to reduced sampling errors in the final estimates of parameters {*μ*_{f}(*k*) and *σ*_{f}(*k*), and *ρ*(*k*), *k* = 1, 2, …, 12}.

### b. Forecast evaluation

Here we evaluate the SCC calibrated forecasts and make comparison with the raw forecasts and SIC calibrated forecasts. Both the SCC and SIC calibrated forecasts are using leave-one-month-out cross validation.

#### 1) Average monthly precipitation

Here we present the results of average monthly precipitation for the 3-yr forecast period (Fig. 4). The values for the 10-yr period used to construct the long-term climatology are also plotted. The raw forecasts are clearly too low compared with observed. However, the seasonal pattern is broadly consistent with the observed in relative terms. The SIC calibrated forecasts are too high for dry months and too low for some of the wet months. The SCC calibrated forecasts are much improved and mostly consistent with observations, although there are clear differences for January and March. From a visual inspection of the top panel of Fig. 4, the raw forecasts do not indicate unseasonably low precipitation for January and high precipitation for March (one may use February as a reference to aid the visual inspection). The SCC follows the guidance of the raw forecasts, and is therefore unable to produce forecasts that are strongly divergent from long-term averages.

Average monthly precipitation of forecasts compared with observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Average monthly precipitation of forecasts compared with observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Average monthly precipitation of forecasts compared with observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

#### 2) Average percentage of wet days

Here we present the results on the average percentage of wet days with precipitation greater than 0.2 mm (Fig. 5). The raw forecasts produce many more wet days than observed. However, the seasonal pattern is broadly consistent with the observations in relative terms. The SIC calibrated forecasts better match the overall percentage of wet days, but seasonality is not well reproduced, especially at longer lead times. As a result, there are not enough wet days in wet seasons and too many wet days in dry seasons. The SCC calibrated forecasts reproduce much better the seasonal variation in percentage of wet days. The match is not perfect, especially for January. From a visual inspection of the top panel of Fig. 5, the raw forecasts do not indicate an unseasonably low number of wet days for January (again one may use February as a reference to aid the visual inspection). For this reason, the SCC calibrated forecasts are more in-line with the average condition for January.

Average percentage of wet days of forecasts compared with observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Average percentage of wet days of forecasts compared with observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Average percentage of wet days of forecasts compared with observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

#### 3) Daily precipitation distribution

In Fig. 6, we show the daily precipitation distribution of the SCC calibrated forecasts and compare with observed. For each month, the 3-yr observed daily precipitations are ranked in ascending order and plotted again a standard normal variate, as for a normal probability plot. The median, 50% and 90% intervals of the forecasts are derived as follows for each month.

Step 1. For each forecast day, one ensemble member of the forecast for that day is randomly drawn. This is repeated for all forecast days. The drawn ensemble members for all the days are ranked, giving a ranked ensemble member data series.

Step 2. Repeat step 1 many times (say 1000) to give many ranked ensemble member data series.

Step 3. For the same rank, find the median, 50%, 90% intervals from the data of all series. This is repeated for all the ranks.

Step 4. Plot the values of the median, 50% and 90% intervals against the standard normal variate. The intervals represent the uncertainty of the daily precipitation distribution of the SCC forecasts.

Distribution of SCC calibrated forecast daily precipitation compared with observed. Blue line: forecast median; dark gray shade: forecast [0.25, 0.75] quantile range; light gray shade: forecast [0.05, 0.95] range; red dot: observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Distribution of SCC calibrated forecast daily precipitation compared with observed. Blue line: forecast median; dark gray shade: forecast [0.25, 0.75] quantile range; light gray shade: forecast [0.05, 0.95] range; red dot: observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Distribution of SCC calibrated forecast daily precipitation compared with observed. Blue line: forecast median; dark gray shade: forecast [0.25, 0.75] quantile range; light gray shade: forecast [0.05, 0.95] range; red dot: observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Figure 6 is specifically for day 10 ahead forecasts. At this lead time, forecast skill is low, and forecasts mostly follow climatology. It is clear from Fig. 6 that the daily precipitation distribution of the SCC calibrated forecast is consistent with the observations. In contrast, as shown in Fig. 7, the SIC calibrated forecasts lead to unseasonal daily precipitation distribution, with too much precipitation in dry months and not enough precipitation in wet months.

Distribution of SIC calibrated forecast daily precipitation compared with observed. Blue line: forecast median; dark gray shade: forecast [0.25, 0.75] quantile range; light gray shade: forecast [0.05, 0.95] range; red dot: observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Distribution of SIC calibrated forecast daily precipitation compared with observed. Blue line: forecast median; dark gray shade: forecast [0.25, 0.75] quantile range; light gray shade: forecast [0.05, 0.95] range; red dot: observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

Distribution of SIC calibrated forecast daily precipitation compared with observed. Blue line: forecast median; dark gray shade: forecast [0.25, 0.75] quantile range; light gray shade: forecast [0.05, 0.95] range; red dot: observed.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

For shorter lead times, results on daily precipitation distribution for the SCC are similar, but results for the SIC improve as better skill leads to greater transfer of the seasonality already embedded in raw forecasts to calibrated forecasts. However, even for Day 1 ahead forecasts, the daily precipitation distribution of the SIC calibrated forecasts is still unsatisfactory (results not shown).

#### 4) Forecast reliability

We use PIT (probability integral transform) uniform probability plots to check forecast reliability (Laio and Tamea 2007; Wang and Robertson 2011; Wang et al. 2009), for example, whether forecasts are biased, too wide or too narrow in ensemble spread (forecast uncertainty). Details on PIT uniform probability plots, including how to deal with forecasts with threshold values and how to read PIT uniform probability plots, can be found in Wang and Robertson (2011).

PIT uniform probability plots for the SCC calibrated forecasts are given in Fig. 8. Perfectly reliable forecasts should follow the 1:1 line. For most months, the PIT values are aligned well with the 1:1 line, indicating good reliability. For January, the PIT values are mostly below the 1:1 line, indicating mild positive bias in the forecasts. For March, the opposite is true. This is consistent with the results and interpretation presented earlier on average monthly precipitation.

PIT uniform probability plot of SCC calibrated forecasts.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

PIT uniform probability plot of SCC calibrated forecasts.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

PIT uniform probability plot of SCC calibrated forecasts.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

#### 5) Forecast skill

We use the continuous ranked probability score (CRPS) as a measure of errors between ensemble forecasts and observations (Hersbach 2000). For CRPS to be low, ensemble forecasts need to be both accurate and have the right spread (not too narrow or too wide). We also evaluate CRPS for ensemble forecasts generated from the fitted climatology for each month. This second CRPS is used as a reference, so that percentage of reduction in CRPS from the reference can be calculated to give a CRPS skill score. A skill score of zero indicates that the forecasts are only as good as the reference forecasts. A negative (positive) skill score indicates that the forecasts are poorer (better) than the reference forecasts. Forecasts can reach a skill score of 100% when perfectly matching observations.

Results of the CRPS skill score for the SCC calibrated forecasts are shown in Fig. 9. Under the leave-one-month cross validation, the skill score is mostly positive, generally of high values at shorter lead times and approaching zero at longer lead times. Overall, the skill is low as expected for precipitation. Also shown in Fig. 8 are results of the CRPS skill score for the raw deterministic NWP forecasts. Here CRPS is reduced to absolute errors when forecasts are deterministic. The skill score of the raw forecasts is mostly negative, largely because of bias but also a lack of uncertainty representation in forecasts. The improvement from SCC calibration is clearly demonstrated.

CRPS skill score of SCC calibrated and raw forecasts.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

CRPS skill score of SCC calibrated and raw forecasts.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

CRPS skill score of SCC calibrated and raw forecasts.

Citation: Monthly Weather Review 147, 10; 10.1175/MWR-D-19-0108.1

## 5. Discussion

Our motivation for this work is to develop a calibration method that is highly robust and effective, suitable for use even when only a short period of archived NWP data are available for establishing statistical calibration models. Many NWP models may have a period of just one year or even shorter archived data when they become officially operational. However, in our case study, we have deliberately chosen a NWP dataset that covers three years. As stated in section 3a, this longer dataset makes it possible to check the assumptions involved in the reparameterization. It also means that the forecast evaluation results are subject to less sampling effects and thus more robust for reaching conclusions.

We acknowledge that the testing and evaluation of the SCC calibration model presented in this paper are limited to only one case study. In our further work, we have applied the model to postprocessing forecasts from ACCESS-G APS2 for across Australia. APS2 is the current operational model of the Australian Bureau of Meteorology and has a shorter period of archived data. The results of this application will be reported in the future. For future applications, we speculate that an archived NWP forecast record of at least six months is needed for estimating model parameters {*a*, *b*, *c*, *d*, and *r*}. This needs to be confirmed in future research.

The SCC calibration models presented in this study are established separately for different lead times and locations. When multiple lead times and multiple locations are involved, we need to sensibly connect the calibrated ensemble members for the different lead times and locations, so that temporal and spatial correlation structures resemble observed precipitation. An empirical copula method, the Schaake shuffle, has proven to be a popular and effective tool for connecting ensemble members (Clark et al. 2004; Robertson et al. 2013; Wu et al. 2011). We recommend the use of this method.

## 6. Conclusions

In this study, a seasonally coherent calibration (SCC) model is developed for postprocessing raw forecasts from NWP models. Our aim is to be able to produce calibrated forecasts that are unbiased, reliable in ensemble spread, as skillful as possible and coherent with respect to a seasonally varying climatology. In the development of the model, three issues are resolved.

The first issue is to construct a calibration model that is sophisticated enough to allow for seasonality in statistical characteristics of raw forecasts and observations. Building on previous work on forecast calibration (Robertson et al. 2013), we introduce monthly variations in statistical characteristics of both raw forecasts and observations to reflect seasonal change. However, this introduces a large number of model parameters.

The second issue is to bring climatology that is representative of long-term statistical characteristics of observations into the calibration model. For estimating model parameters related to observed climatology, we make use of a long period of observation data, say the last 10 to 20 years. In this way, we can accurately estimate the parameters for each month, and produce climatology that represents the long-term statistical characteristics of observations.

The third issue is to enable the calibration model to work effectively with very limited archived raw forecast data. Many NWP models have only a short period of archived data, typically one year or less, when they become officially operational. We introduce three relationships to reparameterize 36 of the model parameters to just five. This reparameterization makes it feasible to estimate the model parameters with only a short period of archived NWP data.

In a case study using raw forecast data from a NWP model and historical observations of precipitation, the assumptions underlying the three relationships for reparameterization are well supported by our analysis. Furthermore, by linking 12 months together through these relationships, we expect that the reparameterization may help reduce sampling effect on parameter estimation.

In our case study, the SCC calibrated forecasts are cross validated, and results are evaluated. We find that the developed SCC model largely achieves the aim of producing calibrated forecasts that are unbiased, reliable in ensemble spread, as skillful as possible and coherent in seasonal climatology.

## Acknowledgments

We thank Dr. Prasantha Hapuarachchi from the Australian Bureau of Meteorology for supplying the case study data. Funding source for this work is not applicable. The authors declare that there is no conflict of interest regarding the publication of this article.

## APPENDIX

### Data Normalization Using the Log–Sinh Transformation

*z*, and

*ε*and

*λ*are transformation parameters.

*c*is a scaling factor, introduced to bring the scaled variable (

*z*/

*c*) to a similar range in different applications so that the parameter values for

*ε*and

*λ*are also likely to be in similar ranges, making it easier to look for the optimal parameter values (and set constraints if needed).

*z*(

*i*),

*i*= 1, 2, …,

*I*, with some records having a left-censored value of ≤

*z*

_{c}, we set

*c*= max[

*z*(

*i*),

*i*= 1, 2, …,

*I*]/5 so that the scaled variable [

*z*(

*i*)/

*c*] is in the range of 0–5. The method of maximum likelihood can be applied to estimate the transformation parameters as well as the normal distribution parameters. The likelihood function is

*z*

_{c}. A numerical search method may be used to find the parameter values that maximize the log-likelihood function.

*z*by

For the application in this study, we derive one set of transformation parameters for observed precipitation, and another set for the raw forecast precipitation.

It is noted that other transformation functional forms may also be suitable for precipitation data normalization. A systematic comparison will be made in future research.

## REFERENCES

Bureau of Meteorology, 2010: Operational implementation of the ACCESS Numerical Weather Prediction systems. NMOC Operations Bull. 83, Bureau of Meteorology, 34 pp., https://protect-au.mimecast.com/s/62kdCr8Dz5sGR7kzc7SrLG?domain=bom.gov.au.

Bureau of Meteorology, 2012: APS1 upgrade of the ACCESS-G Numerical Weather Prediction system. NMOC Operations Bull. 93, Bureau of Meteorology, 31 pp., https://protect-au.mimecast.com/s/0e9ACvl0E5u3nqYPFXvXGn?domain=bom.gov.au.

Clark, M. P., S. Gangopadhyay, L. Hay, B. Rajagopalan, and R. Wilby, 2004: The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields.

,*J. Hydrometeor.***5**, 243–262, https://doi.org/10.1175/1525-7541(2004)005<0243:TSSAMF>2.0.CO;2.Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118, https://doi.org/10.1175/MWR2904.1.Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation.

,*Mon. Wea. Rev.***136**, 2620–2632, https://doi.org/10.1175/2007MWR2411.1.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15**, 559–570, https://doi.org/10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.Krzysztofowicz, R., 1999: Bayesian theory of probabilistic forecasting via deterministic hydrologic model.

,*Water Resour. Res.***35**, 2739–2750, https://doi.org/10.1029/1999WR900099.Laio, F., and S. Tamea, 2007: Verification tools for probabilistic forecasts of continuous hydrological variables.

,*Hydrol. Earth Syst. Sci.***11**, 1267–1277, https://doi.org/10.5194/hess-11-1267-2007.Li, W., Q. Duan, C. Miao, A. Ye, W. Gong, and Z. Di, 2017: A review on statistical postprocessing methods for hydrometeorological ensemble forecasting.

,*Wiley Interdiscip. Rev.: Water***4**, e1246, https://doi.org/10.1002/wat2.1246.Nelder, J. A., and R. Mead, 1965: A simplex method for function minimization.

,*Comput. J.***7**, 308–313, https://doi.org/10.1093/comjnl/7.4.308.Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133**, 1155–1174, https://doi.org/10.1175/MWR2906.1.Robertson, D. E., D. L. Shrestha, and Q. J. Wang, 2013: Post-processing rainfall forecasts from numerical weather prediction models for short-term streamflow forecasting.

,*Hydrol. Earth Syst. Sci.***17**, 3587–3603, https://doi.org/10.5194/hess-17-3587-2013.Shrestha, D. L., D. E. Robertson, J. C. Bennett, and Q. J. Wang, 2015: Improving precipitation forecasts by generating ensembles through postprocessing.

,*Mon. Wea. Rev.***143**, 3642–3663, https://doi.org/10.1175/MWR-D-14-00329.1.Sloughter, J. M., A. E. Raftery, T. Gneiting, and C. Fraley, 2007: Probabilistic quantitative precipitation forecasting using Bayesian model averaging.

,*Mon. Wea. Rev.***135**, 3209–3220, https://doi.org/10.1175/MWR3441.1.Wang, Q. J., and D. E. Robertson, 2011: Multisite probabilistic forecasting of seasonal flows for streams with zero value occurrences.

,*Water Resour. Res.***47**, W02546, https://doi.org/10.1029/2010WR009333.Wang, Q. J., D. E. Robertson, and F. H. S. Chiew, 2009: A Bayesian joint probability modeling approach for seasonal forecasting of streamflows at multiple sites.

,*Water Resour. Res.***45**, W05407, https://doi.org/10.1029/2008WR007355.Wang, Q. J., D. L. Shrestha, D. Robertson, and P. Pokhrel, 2012: A log-sinh transformation for data normalization and variance stabilization.

,*Water Resour. Res.***48**, W05514, https://doi.org/10.1029/2011WR010973.Wilks, D. S., and T. M. Hamill, 2007: Comparison of ensemble-MOS methods using GFS reforecasts.

,*Mon. Wea. Rev.***135**, 2379–2390, https://doi.org/10.1175/MWR3402.1.Wu, L., D.-J. Seo, J. Demargne, J. D. Brown, S. Cong, and J. Schaake, 2011: Generation of ensemble precipitation forecast from single-valued quantitative precipitation forecast for hydrologic ensemble prediction.

,*J. Hydrol.***399**, 281–298, https://doi.org/10.1016/j.jhydrol.2011.01.013.Zhao, T., J. C. Bennett, Q. J. Wang, A. Schepen, A. W. Wood, D. E. Robertson, and M.-H. Ramos, 2017: How suitable is quantile mapping for postprocessing GCM precipitation forecasts?

,*J. Climate***30**, 3185–3196, https://doi.org/10.1175/JCLI-D-16-0652.1.