1. Introduction
Accurate precipitation forecasts are crucial for social and economic sectors. Quantitative precipitation forecasts (QPFs) of daily accumulated rainfall plays an important role not only in water supply, flood risk, and drought mitigation, but also in guiding agricultural management and the operations of hydroelectric power plants (Theis et al. 2005). For example, accurate and spatially detailed QPFs are of particular interest and importance in California (Dettinger et al. 2011; Corringham et al. 2019). Statewide water management suffers from a distinct spatial mismatch that 75% of its rain and snow is received from the watersheds north of Sacramento, California, yet 80% of the demand comes from south of Sacramento. Furthermore, California experiences a uniquely high variability in precipitation (Dettinger et al. 2011) governed by the presence or absence of a relatively small number of large storms, typically landfalling Atmospheric rivers (ARs) in the winter months (Dettinger and Cayan 2014; Oakley et al. 2018; Corringham et al. 2019). Reliable, accurate, and timely predictions of precipitation at weather time scales have the potential to inform operational decisions related to reservoir management and flood emergency response, e.g., through the Forecast Informed Reservoir Operations initiative that incorporates forecast information into water management decisions (Jasperse et al. 2020).
The current state-of-the-art method for forecasting precipitation events is through numerical weather prediction (NWP) which is based on state-of-the-art dynamical weather models. However, the accuracy of NWP can still be limited by errors in initial conditions, numerical approximations, incomplete understanding of underlying physical processes, and the chaotic nature of the atmosphere (Lorenz 1963; Gleick 2008). These errors from NWP will propagate to subsequent hydrological model simulations and affect the quality and uncertainty of the end products. In this context, machine learning (ML) methods can be designed to correct a portion of the errors that contaminate dynamical model estimates in a postprocessing framework.
ML methods include traditional statistical methods, neural-based algorithms, and deep learning (DL) approaches (Goodfellow et al. 2016). Statistical postprocessing methods can be categorized into parametric and nonparametric methods depending on whether a distribution is considered a priori or not. More recently, Vannitsem et al. (2021) refer to them as “distribution-based assumptions” and “distribution-free assumptions” approaches, respectively. Nonparametric methods include weather analogs (Hamill and Whitaker 2006; Alessandrini et al. 2015; Hamill et al. 2015; Delle Monache et al. 2013; Scheuerer et al. 2020) and quantile regression (Bremnes 2004). These methods do not assume a prior distribution of the variable of interest, rather the model structure is determined from data. In contrast, parametric methods typically assume a prior distribution of the predictand and try to estimate the associated distributional characterizing variables. Examples include ensemble model output statistics (Wilks 2009; Scheuerer and Hamill 2015), mixed-type meta-Gaussian distribution (Herr and Krzysztofowicz 2005; Wu et al. 2011), and Bayesian-based methods (Raftery et al. 2005; Wang et al. 2009).
More recently, neuron-based algorithms are becoming increasingly popular for postprocessing NWP. Rasp and Lerch (2018) demonstrated how neural networks (NNs) could be used for probabilistic postprocessing ensemble forecasts within the distributional regression framework. For 2-m temperature forecasts, an NN was trained to learn the distributional parameters of a Gaussian predictive distribution. Taillardat et al. (2019) combined a random forest technique with a parametric distribution to calibrate rainfall ensemble forecasts and concluded that the hybrid approach produced the most skill improvements for forecasting heavy rainfall events. Ghazvinian et al. (2021, 2022) proposed a hybrid NN–nonhomogeneous regression-based scheme that uses an NN to learn the distributional parameters of a censored, shifted gamma distribution (CSGD). This approach provided a unified way to postprocess precipitation forecasts at multiple lead times and seasons. The advantage of an NN is its ability to reconstruct highly nonlinear functions and to explore a vast amount of data. Therefore, NN is suitable as a postprocessing method to cope with the high nonlinearity and dimensionality in a weather system.
One disadvantage of an NN, however, is its limited ability to extract spatial features due to their fully connected layers. DL with convolutional neural network (CNN), hence, has been found promising for abstracting spatial information from high-resolution forecasts. It has been successfully applied to the statistical downscaling of temperature and precipitation over complex terrains (Pan et al. 2019; Sha et al. 2020a,b). Past work has shown its capability of discerning fine-grained spatial details. For precipitation forecasting, Li et al. (2022) proposed using CNN to learn the distributional parameters of CSGD but the input to the CNN are predictors of a square patch of 7 × 7 grids centered at the grid to be predicted, rather than the entire domain. Chapman et al. (2019) first introduced denoising autoencoders for postprocessing deterministic forecasts of integrated vapor transport (IVT) and integrated water vapor (IWV), and the model takes predictors for the entire domain. This approach was later improved for probabilistic forecasts (Chapman et al. 2022) but the study domain was confined within a narrow band along the western coast and the spatial resolution was limited to 0.5°.
Inspired by the rich literature (Ronneberger et al. 2015; Chapman et al. 2019; Ghazvinian et al. 2021; Chapman et al. 2022; Ghazvinian et al. 2022; Han et al. 2022; Li et al. 2022; Badrinathat et al. 2022), we propose to use Unet, a type of denoising autoencoder, to generate high-resolution, accurate, and reliable probabilistic quantitative precipitation forecast (PQPF) in this work. Both input and output of Unet are high-resolution maps that cover the entire study domain which avoids training and running separate models at each grid point. We aim to address three important questions in this work:
-
How can we adopt an Unet architecture for generating high-resolution precipitation forecasts?
-
How do we train such a network to characterize the forecast uncertainty?
-
What are the strengths and limitations of using a large model compared to other state-of-the-art benchmark methods?
The rest of the paper is organized as follows: section 2 describes observations and forecasts used in this study. Section 3 introduces the design of Unet and other benchmark postprocessing methods. Specifically, section 3b(1) proposes the solution for the first question on Unet architecture and section 3b(2) addresses the second question on model training. Section 4 exhibits results and demonstrates evidence of the strengths and limitations of Unet to address the third question. Finally, section 5 provides a summary of the work and additional discussion.
2. Data
a. High-resolution spatial climate data
Precipitation ground truth is collected from the Parameter Elevation Regression on Independent Slopes Model (PRISM) (Daly et al. 2002; Strachan and Daly 2017). PRISM provides a daily gridded precipitation dataset over the continental United States with a 4-km spatial resolution. It leverages multiple data sources including surface precipitation gauge networks as well as radar observations. Its ingestion of a digital elevation model (DEM) allows PRISM to accurately account for complex climate regimes associated with orography, rain shadows, temperature inversions, slope aspect, coastal proximity, and other factors. PRISM has been widely used in various precipitation studies, e.g., Lewis et al. (2017), Ishida et al. (2015), Buban et al. (2020). It is suitable for this study as it provides a high-resolution gridded precipitation product with a continuous time series. Precipitation records from 1986 to 2019 have been analyzed in this study.
b. West-WRF reforecast
Reforecast products are forecasts from the same modeling system spanning multiple decades. One of its benefits is that they enable researchers to evaluate the predictability of historical high-impact precipitation events given a current model. Reforecast products can reveal important information about the model performance over a wide range of atmospheric and hydrological conditions, and they are often used to improve current predictions and build postprocessing methods.
The West Weather Research and Forecasting (West-WRF) Model reforecast (Martin et al. 2018) is run for a total of 34 water years from 1986 to 2019. There are two integration domains, 9, and 3 km. The 3-km domain is one-way nested into the 9-km domain. A cumulus scheme is used in the 9-km domain but not in the 3-km domain. The radiation scheme is RRTMG as described in Iacono et al. (2008). Other physics schemes are described in Martin et al. (2018). An adaptive time step is used for all domains starting at 5dx and ranging between 1dx and 8dx, targeting a domain-wide vertical Courant–Friedrichs–Lewy criterion of 1.28.
In this study, all years refer to the water year starting from the December of the specified year and ending on the last day of March of the next year. For example, the water year 2019 refers to the period from 1 December 2019 to 31 March 2020. The 3-km domain is used in this work (See boundaries of the full model domains and the study domain in the supplemental material, Fig. 1). Forecasts are initialized at 0000 UTC for 120 h (5 days) into the future. However, the precipitation accumulation period of a “PRISM day” is 1200 to 1200 UTC. As a result, there are only forecasts for four overlapped lead days.
Figure 1 shows the study domain between [32.47°, 41.49°N] and [124.44°, 116.21°W]. This is the region of overlap between West-WRF and PRISM. Forecasts from the 3-km grid have been collected, and the predictors are shown in Table 1 as they have been previously found to be highly effective at capturing the state of the atmosphere for atmospheric rivers and precipitation events (Chapman et al. 2019, 2022). West-WRF is regridded to the PRISM grid using the nearest neighbor, prior to model training and data analysis.

Study domain and terrain elevation (color shading).
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

Study domain and terrain elevation (color shading).
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Study domain and terrain elevation (color shading).
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Abbreviations and descriptions of weather variables used as predictors.


3. Methods
a. Benchmark methods
1) Censored, shifted gamma distribution
The CSGD and its variants have been shown to be able to generate calibrated and highly skillful PQPFs in a variety of hydrometeorological conditions and spatiotemporal scales, e.g., Ishida et al. (2015), Baran and Nemoda (2016), Zhang et al. (2017), Bellier et al. (2017), Scheuerer et al. (2017), Baran and Lerch (2018), Hamill and Scheuerer (2018), Taillardat et al. (2019), Scheuerer et al. (2020), Ghazvinian et al. (2020), Lei et al. (2022). A notable limitation of EMOS schemes CSGD included is that their performance can depend on prescribed inflexible predictor–predictand relationships as pointed out by Rasp and Lerch (2018), Ghazvinian et al. (2021). Recent studies (Ghazvinian et al. 2020, 2021; Valdez et al. 2022) showed that the predictive CSGD (Scheuerer and Hamill 2015) tends to underestimate the probability of precipitation (PoP) due to its reliance on climatological shift parameters. This bias was found to be more evident in shorter lead times and to some extent directly affected the CSGD’s overall predictive performance. In this study, separate CSGD regression models are trained for each pixel and each lead time.
2) Mixed-type meta-Gaussian distribution
MMGD uses the meta-Gaussian distribution (Kelly and Krzysztofowicz 1997) to estimate conditional distribution DY|X(y|x), which relies on parametric normal quantile transformation (NQT) of positive observation–forecast pairs. See Wu et al. (2011), Ghazvinian et al. (2020), Ghazvinian (2021) for the MMGD parameter estimation and derivation details. We estimate the MMGD parameters locally using the training sample pooled across all months (excluding the test period) and for each grid point separately. We model the marginal distributions using a gamma distribution as it provides the best fit to our data.
A number of past studies have conducted performance evaluations or comparisons for the MMGD with other schemes and for different regions within the continental United States. Examples include Wu et al. (2011), Brown et al. (2014), Zhang et al. (2017), Wu et al. (2018), Kim et al. (2018), Ghazvinian et al. (2019, 2020, 2021). The results collectively indicate that while MMGD is a highly parsimonious mechanism and can preserve the skill in the raw forecast, it has a notable limitation which is the tendency of the model to under-forecast heavy-to-extreme precipitation amounts as a result of conditional biases and lack of capability to adequately capture forecast heteroscedasticity as a result of NQT transformation among others. The under-forecast problem is very serious as the performance of ensemble streamflow forecasts from the U.S. NWS HEFS operations depend critically on the PQPF performance from the MMGD. In this study, separate MMGD regression models are trained for each pixel and each lead time.
3) Analog Ensemble
Analog ensemble (AnEn) (Delle Monache et al. 2013; Hu et al. 2020) is a technique to generate forecast ensembles from deterministic predictions. It is different from the previous parametric methods where a prescribed distribution for forecasting precipitation is first assumed and then the algorithm fits the distributional parameters to the data. AnEn, however, relies on prediction similarity and historical observations to generate forecast ensembles.
Once the similarity metric is calculated for all t′, the historical forecasts with the lowest distances are selected as analog forecasts and their associated observations comprise AnEn members. This process is repeated independently at all grid points and for all lead times. The generated ensemble can then be used to build empirical distribution and be converted to probabilistic forecasts to be consistent with previous parameter methods.
The most important parameters of AnEn are predictor weights and the number of analog members. Both parameters are optimized using the training dataset (data excluding the testing period) and a constrained extensive grid search. For all experiments, 15 members are generated by AnEn.
b. Proposed deep learning for forecast uncertainty
1) U-net architecture
The proposed DL postprocessing model is built after the Unet (Ronneberger et al. 2015; Long et al. 2015) architecture. The architecture features a U-shape diagram with an encoder (left), a bottleneck (bottom), and a decoder (right), as shown in Fig. 2. Rectangles represent multidimensional tensors and arrows represent various operations. Spatial dimensions are denoted within square brackets and the number of features in each tensor is indicated atop rectangles.

Unet architecture for probabilistic precipitation forecasts.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

Unet architecture for probabilistic precipitation forecasts.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Unet architecture for probabilistic precipitation forecasts.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
To begin with, forecasts with 13 variables (listed in Table 1) are first standardized and then input into the encoder branch to be spatially compressed through repeated convolutions. Meanwhile, more features are generated by the convolution that extracts high-level spatial information. The output of the encoder branch is usually referred to as the bottleneck. It is then followed by a reconstruction stage by feeding the bottleneck features into a decoder branch. The decoder branch is composed of repeated deconvolution (or transpose convolution) and skip connections (Drozdzal et al. 2016). The decoder structure is indispensable for image-to-image problems because it expands the spatial domain from the lower-resolution representation and reconstructs the desired output dimension. As a result, the output of the decoder has the same spatial dimensions as the input. Each of the grid points has a distinct set of three variables that determine the CSGD at the location.
Skip connection, as shown in gray arrows in Fig. 2, is an indispensable component of the Unet architecture which improves model training stability and helps preserve fine features in high spatial resolution input (Mao et al. 2016). Deep networks suffer from gradient explosion and vanishing issues (Mao et al. 2016; Tong et al. 2017). During weight optimization and backward propagation, according to the chain rule, error gradients are multiplied as they pass along the network. However, in the long chain of multiplication (following the U-shape path), error gradients can be numerically unstable. Skip connections, however, provide an additional path for error terms to pass through the network, by concatenating features from the encoder stage. As a result, model training becomes more stable. (Table A1 in appendix A shows a detailed list of model parameters and the configuration of each layer in the proposed Unet model.)
2) Loss function and model training
As shown in Fig. 2, Unet produces distributional parameters at each grid point. To optimize the learned distribution, we use CRPS (Hersbach 2000) as the loss function during model training. CRPS is a probabilistic scoring metric that integrates the squared difference between CDF of forecasts and observations. Minimizing CRPS encourages a sharper and bias-corrected forecast distribution. Empirically, CRPS can be calculated by aggregating Brier scores with all possible thresholds. For deterministic predictions, CRPS collapses to mean absolute error (MAE).
Another possible solution is to engage early stopping, but in practice, it resulted in too few training epochs (fewer than 5 in our case). The Unet failed to converge and was underperforming. We also found that (not shown) the Unet ensemble tends to be underdispersive if no constraint is imposed on forecast uncertainty.
In this study, one Unet is trained for the entire domain and separate Unet models are trained for each lead day. Each Unet learns (outputs) the three distributional parameters simultaneously associated with a CSGD. Four water years are selected as tests: 1997 being one of the strongest El Niño events during the past 60 years of records; 2011 being identified a La Niña year; and two El Niño–Southern Oscillation (ENSO) neutral years, namely, 2016 and 2013, being wet and dry, respectively. El Niño forcing and background internal variability has been shown to influence precipitation predictability (Chapman et al. 2021). Therefore, these four years are selected to test the robustness of the proposed algorithm. Models are trained for each water year independently with the previous year as validation and all other years as training.
Hyperparameter tuning is carried out using the training and validation data, excluding the four years used in testing. Stochastic gradient descent with a minibatch (Li et al. 2014) of size 8 and a momentum of 0.9 is used during weight optimization. Cyclical learning rate (Smith 2017) is used with a maximum of 10−2, a minimum of 10−5, a step size of 5, and a shrink factor of 2. The step size indicates the number of training iterations it takes to go from the minimum to the maximum learning rate. After each cyclic walk (maximum to minimum and then back to maximum), learning rates are divided by the shrinking factor. The maximum training iteration is set to 200 but early stopping is engaged if no improvement has been observed in the validation loss for 20 consecutive iterations.
To determine the best scaling factor α, a sequence of values from 0 to 1 with an increment of 0.025 have been tested. Although there are no theoretical upper and lower bounds for α, negative and larger than one values lead to unrealistic uncertainty estimation. Therefore, the parameter search is confined between 0 and 1. An α value of 0.35 is chosen for training the final Unet because it achieved the best CRPS and spread–skill correlation (not shown) on the validation data.
4. Results
In this section, we present precipitation forecast comparisons and their verification over the 4-yr test period. Unet forecasts are compared to the baseline Weather Research and Forecasting (WRF) model forecasts and three different postprocessing schemes, AnEn, MMGD, and CSGD. Since WRF is a deterministic NWP model and all other methods generate probabilistic forecasts, the distribution mean is used to compare with WRF forecasts when a deterministic form is desired. Additional descriptions of the verification metrics used are provided in appendix B.
a. Deterministic predictions and forecast uncertainty
The ability of a model to accurately predict the accumulated precipitation throughout the rainy season is critical for water management. Figure 3 shows the mean areal precipitation over the Yuba–Feather River watershed that stems from the west slope of the Sierra Nevada and sits to the northeast of Sacramento. This watershed includes several hydropower projects and accurate precipitation forecasts are critical to its operation. The time series shows precipitation aggregated over days in the rainy season starting in December 2016 which marked a particularly wet year.

Mean areal precipitation over the Yuba–Feather River watershed with the topographic map at the lower right. The time series shows precipitation aggregated over days in the water year starting in December 2016. For Unet, AnEn, CSGD, and MMGD, the distributional mean is used to calculate the time series.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

Mean areal precipitation over the Yuba–Feather River watershed with the topographic map at the lower right. The time series shows precipitation aggregated over days in the water year starting in December 2016. For Unet, AnEn, CSGD, and MMGD, the distributional mean is used to calculate the time series.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Mean areal precipitation over the Yuba–Feather River watershed with the topographic map at the lower right. The time series shows precipitation aggregated over days in the water year starting in December 2016. For Unet, AnEn, CSGD, and MMGD, the distributional mean is used to calculate the time series.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
As shown in Fig. 3, WRF remained close to PRISM until 18 January 2017 when WRF started to overpredict. There were several major rain events on 9 January 2017, 19 January 2017, 27 February 2017, and 21 February 2017, and WRF showed significant overprediction during the latter three events, enlarging the difference to PRISM. On the other hand, MMGD, CSGD, and AnEn all showed underprediction early on during the water year around 13 December 2016, with MMGD consistently producing the most underprediction among the others. Unet showed a mixed performance throughout the water year, underpredicting the rain event during 5 January 2017 and 13 January 2017 but overpredicting the rain event during 1 February 2017 and 10 February 2017. However, overall, Unet closely follows PRISM throughout the year and its prediction for the year-round total precipitation is the most accurate compared to other baseline forecasts.
During the 2016/17 rainy season, the largest amount of daily accumulated rain was received on 9 January 2017 and Fig. 4 compares forecasts and observations for this event. PRISM is shown in Fig. 4a and WRF is shown in Fig. 4b. The distributional means from MMGD, CSGD, AnEn, and Unet are shown in Figs. 4c–f, with their standard deviations, σ, plotted below. Forecasts for the first lead day are shown.

Forecasts and observations of the precipitation event on 9 Jan 2017: (a) the observations from PRISM; (b) the deterministic WRF reforecast; (c)–(f) the distributional mean from MMGD, CSGD, AnEn, and Unet, respectively; and (g)–(j) the standard deviation σ. For better visualization, the precipitation difference is shown in supplemental Fig. 2.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

Forecasts and observations of the precipitation event on 9 Jan 2017: (a) the observations from PRISM; (b) the deterministic WRF reforecast; (c)–(f) the distributional mean from MMGD, CSGD, AnEn, and Unet, respectively; and (g)–(j) the standard deviation σ. For better visualization, the precipitation difference is shown in supplemental Fig. 2.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Forecasts and observations of the precipitation event on 9 Jan 2017: (a) the observations from PRISM; (b) the deterministic WRF reforecast; (c)–(f) the distributional mean from MMGD, CSGD, AnEn, and Unet, respectively; and (g)–(j) the standard deviation σ. For better visualization, the precipitation difference is shown in supplemental Fig. 2.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
On 9 January 2017, a large amount of precipitation was received over the Sierra Nevada with the northern rain region extending further to northwestern Nevada. An average precipitation of 90 mm was received for the day along the coast to the north and over the Russian River watershed. On the other hand, WRF overpredicted the precipitation over the Sierra Nevada and at Sacramento, California. WRF produced an overprediction to the west of the Mojave Desert and it failed to predict the light rain in southern California.
The four postprocessing methods showed mixed results. MMGD underpredicted both the intensity and the span of precipitation, for example, in Central Valley and the Sierra Nevada. CSGD and AnEn both exhibited a better performance at estimating the rain intensity over the Sierra Nevada. However, these forecasts still suffered from an underprediction of the Russian River watershed in northern California. Finally, Unet best resembles the ground truth, PRISM, for example, in the Sierra Nevada and the Russian River watershed. Additionally, Unet corrected the overprediction from WRF to the west of the Mojave Desert and improved its predictions for light rain regions in southern California. On average, the root-mean-square errors (RMSEs) for this day are 20.01 mm (WRF), 21.44 mm (MMGD), 16.43 mm (CSGD), 16.16 mm (AnEn), and 12.55 mm (Unet).
The forecast uncertainty, quantified using the distributional standard deviation, is plotted below precipitation maps. In general, forecast uncertainty is correlated with precipitation intensity. A low uncertainty produces a sharp forecast, but the uncertainty estimation should also be reflective of the expected predictive skill of the forecast. Although Unet has higher uncertainty than other methods over the Sierra Nevada (Fig. 4j), it provides a timely warning that the coming event is hard to predict, and this information matches up with the rarity and intensity of the rain event. On the other hand, Unet produced a correct forecast with lower uncertainty at the southern Coastal Range and to the west of the Mojave Desert while all other methods failed. This difference suggests that the additional spatial information in Unet helps to correct for small-scale precipitation mismatch.
It is worthwhile to point out the visually smoothed precipitation field generated by Unet, as a comparison to other benchmark methods. The main reason is that only one Unet is trained across the domain and the convolution-based architecture tends to generate smoother output. But MMGD, CSGD, and AnEn are all applied grid-by-grid, and therefore, their results seem to better discern local variability. However, having local variability does not necessarily indicate predictive skills. It is a balance between resolving fine-level features and improving forecast skills.
To provide a systematic evaluation, we calculate two deterministic metrics—RMSE and Pearson correlation—from a 4-yr testing period and across the entire study domain (48 108 grid points), shown in Fig. 5. RMSE skill scores are calculated using climatological CSGD as the reference forecast.

(a) RMSE skill score and (b) correlation as a function of forecast lead times. Skill scores are calculated using the climatological CSGD as the reference forecast. Vertical dashes indicate a 95% confidence interval from bootstrapping. Note the ranges of the vertical axes vary across panels.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

(a) RMSE skill score and (b) correlation as a function of forecast lead times. Skill scores are calculated using the climatological CSGD as the reference forecast. Vertical dashes indicate a 95% confidence interval from bootstrapping. Note the ranges of the vertical axes vary across panels.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
(a) RMSE skill score and (b) correlation as a function of forecast lead times. Skill scores are calculated using the climatological CSGD as the reference forecast. Vertical dashes indicate a 95% confidence interval from bootstrapping. Note the ranges of the vertical axes vary across panels.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
RMSE assigns disproportionate weights for samples with different errors, and it penalizes larger errors. In Fig. 5a, all postprocessing methods show improvements over WRF but their performance is close to each other. However, Unet remains at the top of the diagram, having the best predictive still. In terms of correlation (Fig. 5b), results are similar to RMSE that Unet consistently outperforms all benchmarks across all forecast lead times, having the highest correlation. Although Unet and CSGD have the same prescribed distribution, Unet has the additional DL architecture that can encode spatial features and improve its prediction accuracy.
b. CRPS
CRPS is used to evaluate the quality of probabilistic forecasts. Since West-WRF is a deterministic system, MAE is calculated as opposed to CRPS. The CRPS skill score is calculated using climatological CSGD as the reference forecast.
In Fig. 6, all postprocessing methods are shown to yield significant improvement over WRF. CSGD and MMGD have similar performance, but AnEn and Unet are shown to outperform both. AnEn and Unet are the two data-driven methods. Results suggest they can better exploit the huge dataset size than CSGD and MMGD. Unet is the best-performing method, having the highest CRPS skill scores. This is probably because AnEn can be limited by its similarity metric when locating weather analogs. It relies on a set of weather analogs to derive the forecast distribution. Unet, on the other hand, directly learns the distributional parameters, without the need to form an ensemble.

CRPS skill score calculated using climatological CSGD as the reference forecast. Vertical dashes indicate a 95% confidence interval from bootstrapping.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

CRPS skill score calculated using climatological CSGD as the reference forecast. Vertical dashes indicate a 95% confidence interval from bootstrapping.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
CRPS skill score calculated using climatological CSGD as the reference forecast. Vertical dashes indicate a 95% confidence interval from bootstrapping.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Another way to evaluate the performance is by visualizing the spatial variation of CRPS. Figure 7 shows geographic maps of CRPS skill scores for WRF and the four postprocessing methods in the first five panels. Green indicates the evaluated method outperforms and red indicates climatological CSGD outperforms. As a benchmark, WRF outperforms the climatological CSGD for most regions in the study domain except for distributed patches to the north of California and around the Northern Basin and Range in Nevada. MMGD and CSGD predictions are shown to be most skillful around the Sierra Nevada and along coastal regions. These two methods, however, are found to yield less improvement to the northeast of the domain. In terms of AnEn and Unet, both methods are shown to perform well across the domain, having higher CRPS skill scores compared to the previous methods. These results suggest that AnEn and Unet predictions are skillful at locations with disparate climatology.

CRPS skill score maps averaged from all forecast lead times for (a) WRF, (b) MMGD, (c) CSGD, (d) AnEn, and (e) Unet. The skill scores are calculated using climatological CSGD as the reference forecast; (f) the CRPS skill score of Unet against AnEn.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

CRPS skill score maps averaged from all forecast lead times for (a) WRF, (b) MMGD, (c) CSGD, (d) AnEn, and (e) Unet. The skill scores are calculated using climatological CSGD as the reference forecast; (f) the CRPS skill score of Unet against AnEn.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
CRPS skill score maps averaged from all forecast lead times for (a) WRF, (b) MMGD, (c) CSGD, (d) AnEn, and (e) Unet. The skill scores are calculated using climatological CSGD as the reference forecast; (f) the CRPS skill score of Unet against AnEn.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
The last panel (Fig. 7f) further compares AnEn and Unet by calculating the CRPS skill score of Unet against AnEn. Blue indicates Unet predictions have better skills and red indicates AnEn predictions have better skills. Results show that Unet has either similar or better skills than AnEn for most parts of the domain. Unet outperforms AnEn in Northern California and the Sierra Nevada where most of the rain is typically received but variation exists in southern Central Valley and Death Valley. These areas are among the driest places in the western United States and AnEn is shown to have better skill. One difference between Unet and AnEn is that AnEn searches for weather analogs independently at each grid point while only one Unet is trained to predict the entire domain. Aside from the computational advantage of Unet, it is also highly effective at extracting spatial information and learning a skillful relationship between forecasts and observations within a spatial domain.
c. Spread–skill relationship, reliability, and resolution
Having high accuracy is critical but only partial to building a robust probabilistic postprocessing workflow. The overall quality of probabilistic forecasts also depends on ensemble consistency and forecast reliability.
Figure 8 evaluates and compares the binned spread–skill correlation (Van den Dool 1989; Wang and Bishop 2003; Leutbecher and Palmer 2008) for various postprocessing methods, aggregated across lead times and the study domain. Forecast spread is estimated using standard deviations and standard error is calculated as RMSE of the distributional mean. The diagonal lines on each panel depict the perfect correlation between the spread and the standard error. A high correlation indicates that the forecast spread is consistent with the expected skill, and it can be a reliable first-order estimate of the flow-dependent error. The spread–skill metric is applicable for evaluating heteroscedastic predictor–predictand relationships with time- or ensemble-varying ensemble spread. The method provides a way to show that errors are larger in relationship to larger spread/uncertainty. However, the method is not a diagnostic for the source of spread, but only as a metric to show that spread is a reliable indicator of forecast uncertainty (Hopson 2014).

Binned spread–skill correlation of (a) CSGD, (b) MMGD, (c) AnEn, and (d) Unet. Spread is estimated using standard deviation and skill is estimated using RMSE. The diagonal line shows the perfect correlation. Vertical dashes show the 95% confidence interval.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

Binned spread–skill correlation of (a) CSGD, (b) MMGD, (c) AnEn, and (d) Unet. Spread is estimated using standard deviation and skill is estimated using RMSE. The diagonal line shows the perfect correlation. Vertical dashes show the 95% confidence interval.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Binned spread–skill correlation of (a) CSGD, (b) MMGD, (c) AnEn, and (d) Unet. Spread is estimated using standard deviation and skill is estimated using RMSE. The diagonal line shows the perfect correlation. Vertical dashes show the 95% confidence interval.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
To begin with, all four methods show a high level of consistency between forecast skill and spread. This suggests that forecast uncertainty can be effectively estimated with statistical methods. MMGD, overall, has the best spread–skill correlation for forecasts with smaller spread (<10 mm), most likely due to having two separate distributions for dry and wet conditions, respectively. However, it becomes over-dispersive with forecasts with a large spread (>10 mm). CSGD, AnEn, and Unet are all shown to have under-dispersive forecast ensembles overall. The difference lies in forecasts with a large spread (>10 mm). AnEn and Unet have a high spread–skill correlation, mostly likely due to their ability to incorporate more predictor variables than CSGD and MMGD.
Unet was previously found to produce under-dispersive ensembles when applied to IVT forecasts [Fig. 5 in Chapman et al. (2022)] and also in this work when not using the regularization term on forecast uncertainty (not shown). When adding the regularization as in Eq. (5), the final uncertainty estimation depends on both output parameters from Unet, σ, and μ. This improvement in ensemble consistency suggests that the regularization of the learned uncertainty is necessary and effective for training a reliable model.
Figure 9 shows the Brier score (first row) and its decomposition into reliability (second row) and resolution (third row). The Brier score is an error metric measuring the average gaps between forecasted probabilities and actual outcomes. A lower Brier score is better. The Brier score can be further decomposed into reliability, resolution, and uncertainty (Murphy 1973). The uncertainty measures the inherent variability in the outcomes of the event, and it is not conditioned on forecasts. Resolution quantifies the ability of the method to discriminate event probabilities that are different from climatology. Therefore, a high-resolution score is preferred. The reliability score is used to measure the calibration mismatch between forecasted probabilities and observed frequencies and therefore, a lower reliability score is better. When calculating Brier scores for a continuous variable, a predefined threshold is needed to binarize ground truths and to convert the CDF to probability. Therefore, three thresholds are used, 1 mm for evaluating how well different methods predict PoP, and two quantiles (95% and 99%) for evaluating skills of extreme events. In terms of the Brier score (Figs. 9a–c), Unet is shown to outperform all other methods with all three thresholds. This result suggests that Unet generates the most skillful predictions for a wide range of rain events.

(top) Brier score and its decomposition into (middle) reliability and (bottom) resolution for three thresholds: (left) 1 mm and (center) 95% and (right) 99% of the location-specific climatological distribution. Percentile values are calculated for each grid point using the associated climatology. Vertical dashes indicate a 95% confidence interval. Note the ranges of the vertical axes vary across panels.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

(top) Brier score and its decomposition into (middle) reliability and (bottom) resolution for three thresholds: (left) 1 mm and (center) 95% and (right) 99% of the location-specific climatological distribution. Percentile values are calculated for each grid point using the associated climatology. Vertical dashes indicate a 95% confidence interval. Note the ranges of the vertical axes vary across panels.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
(top) Brier score and its decomposition into (middle) reliability and (bottom) resolution for three thresholds: (left) 1 mm and (center) 95% and (right) 99% of the location-specific climatological distribution. Percentile values are calculated for each grid point using the associated climatology. Vertical dashes indicate a 95% confidence interval. Note the ranges of the vertical axes vary across panels.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
In terms of reliability (Figs. 9d–f; lower is better), MMGD has the best calibration for forecasting PoP (Fig. 9d). This is because MMGD separates dry and wet conditions and estimates parameters for each condition. However, it becomes less reliable for more extreme events (Figs. 9e,f). Both CSGD and Unet assume a censored, shifted gamma distribution and attempt to model no-rain events by applying shifts to distributions. The distributional assumption helps to predict extreme events, as shown in Figs. 9e and 9f. On the other hand, AnEn and Unet are comparable in forecasting extreme events but Unet slightly outperforms AnEn during light rain events. It is worthwhile to mention the oscillating behavior of the Unet as a function of lead time in Figs. 9e and 9f. Although other methods show similar trends, the Unet is the most exaggerated. This might be related to the observational dataset used in this study but we are currently unsure of the exact cause.
In terms of resolution (Figs. 9g–i; higher is better), Unet again outperforms other methods with all three thresholds meaning that Unet typically has a better predictive skill over events that are different from climatology. The improvement in resolution outweighs the underperformance in reliability. As a result, the overall Brier score is improved (lower value) by Unet. The outperformance of Unet suggests that CRPS, being a proper scoring rule, encourages optimization for both accuracy and reliability. With a large parameterization (over 1 million model parameters), Unet has the flexibility and capability to distinguish a wide array of precipitation events and produce accurate and reliable probabilistic forecasts.
Figure 10 visualizes the spatial variation of Brier skill scores of Unet for the same three thresholds, 1 mm and 95% and 99% against MMGD, CSGD, and AnEn. Panels in the last column are generated using PRISM. Figure 10j shows the map of PoP using 1 mm as the threshold. Figures 10k and 10l show the precipitation maps for the 95th and 99th percentiles.

Brier skill scores of Unet averaged from all forecast lead times with three thresholds: (top) 1 mm, (middle) 95%, and (bottom) 99% of the location-specific climatological distribution, against (a),(d),(g) MMGD, (b),(e),(h) CSGD, (c),(f),(i) AnEn, and (j)–(l) generated with PRISM. A map of PoP using 1 mm as the threshold is shown in (j), whereas (k) and (l) show the precipitation map for the 95th and 99th percentiles, respectively.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

Brier skill scores of Unet averaged from all forecast lead times with three thresholds: (top) 1 mm, (middle) 95%, and (bottom) 99% of the location-specific climatological distribution, against (a),(d),(g) MMGD, (b),(e),(h) CSGD, (c),(f),(i) AnEn, and (j)–(l) generated with PRISM. A map of PoP using 1 mm as the threshold is shown in (j), whereas (k) and (l) show the precipitation map for the 95th and 99th percentiles, respectively.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Brier skill scores of Unet averaged from all forecast lead times with three thresholds: (top) 1 mm, (middle) 95%, and (bottom) 99% of the location-specific climatological distribution, against (a),(d),(g) MMGD, (b),(e),(h) CSGD, (c),(f),(i) AnEn, and (j)–(l) generated with PRISM. A map of PoP using 1 mm as the threshold is shown in (j), whereas (k) and (l) show the precipitation map for the 95th and 99th percentiles, respectively.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
In terms of PoP (Figs. 10a–c), Unet outperforms the benchmark methods across the domain which suggests that Unet predictions have better skill at detecting rain events. The most outperformance is achieved when compared to CSGD (Fig. 10b), which suggests that the Unet architecture is an effective component for ingesting more predictor variables and detecting rain-related spatial patterns with convolutions.
In terms of extreme events (Figs. 10d–i), Unet largely shows outperformance to the benchmark when using a 95th percentile threshold. Unet predictions have a better skill in the Sierra Nevada and the Central Basin and Range in Southern Nevada. However, the performance of Unet tends to vary with regions when using a 99th percentile threshold. MMGD, CSGD, and AnEn are found to perform better in the Mojave Desert and the Nevada Basin and Range while Unet performs better in the rest of the domain. Since only one model is trained with Unet while all other benchmark methods are applied grid by grid, this could be a potential limit of Unet when applied to a large spatial domain and when evaluated with a high threshold. However, on the other hand, Unet typically outperforms other methods in the high-impact regions where precipitation is copious, and its higher biases are found in relatively drier regions.
d. Length of training
Unet is a data-intensive method due to its large parameterization of the model. This section aims to present results on the sensitivity of model performance (predictive skill) on the length of training data and to answer the question, how much data are needed to train a skillful model.
Figure 11 shows CRPS of AnEn and Unet as functions of the number of years in the training data. We only compare these two methods because they are among the best-performing methods in this work. The selected water year for testing is 2016 because it is one of the wettest years throughout the dataset. It is also the most recent year in the test data so the historical record is the longest. Training years are included retrospectively, for example, two training years means the two years prior to 2016 are used during training (searching in the case of AnEn). MAE is shown for WRF. Both methods show improvement over WRF with only two years of training data. However, with the limited set of training data, Unet is likely to overfit, and hence, it only produces a slight improvement, while AnEn is a much more preferred method. This finding is consistent with prior findings (Delle Monache et al. 2013; Eckel and Delle Monache 2016; Hu and Cervone 2019) where using a few years of search can already yield satisfactory results.

CRPS for Unet and AnEn, and MAE for WRF as a function of the number of years in the training data. Unet and AnEn are tested for 2016, with training from previous years. Verification results are aggregated using values for all lead times and locations.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

CRPS for Unet and AnEn, and MAE for WRF as a function of the number of years in the training data. Unet and AnEn are tested for 2016, with training from previous years. Verification results are aggregated using values for all lead times and locations.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
CRPS for Unet and AnEn, and MAE for WRF as a function of the number of years in the training data. Unet and AnEn are tested for 2016, with training from previous years. Verification results are aggregated using values for all lead times and locations.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Since Unet possesses a large number of model parameters, it underperforms AnEn when only a few years of training data are used. But the performance of Unet improves as the number of years increases. After 12 years, the performance of AnEn converges and reaches a plateau. This suggests that AnEn can no longer benefit from having an even larger historical repository. This could be traced to the limitation of the similarity metric (Hu et al. 2023) that, even though more similar forecasts can be found, they might not be better weather analogs that benefit the final prediction accuracy. In contrast, Unet maintains the momentum at increasing its predictive skills and CRPS keeps decreasing when more training data are used. This demonstrates the capability of Unet to intake a large amount of data and still be able to identify patterns that lead to more accurate forecasts. Experiments stopped at 30 years when the maximum number of years is achieved.
Similarly, Fig. 12 shows the Brier scores of Unet and AnEn when an increasing number of training years are used. The two thresholds are consistent with the previous analysis, 95%, and 99%. In this comparison, AnEn starts as the prevailing method again when two years of training data are used. Its performance also stagnates after reaching 12 training years. Unet, on the other hand, overtakes AnEn at an earlier time point when four training years are used, as opposed to eight years in the previous evaluation. This suggests that Unet is optimized for predicting high-impact precipitation events and with proper training, it is capable of capturing patterns that are more pertinent to high-intensity precipitation, even when a limited training dataset is present. Finally, considering the large model, Unet can benefit from having more training data and it is preferable to observe that its performance keeps improving when AnEn has reached its potential.

Brier scores for AnEn and Unet as a function of the number of years in the training data. The thresholds are (a) 95% and (b) 99%. Unet and AnEn are tested for 2016, with training from previous years. Verification results are aggregated using values for all lead times and locations.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

Brier scores for AnEn and Unet as a function of the number of years in the training data. The thresholds are (a) 95% and (b) 99%. Unet and AnEn are tested for 2016, with training from previous years. Verification results are aggregated using values for all lead times and locations.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Brier scores for AnEn and Unet as a function of the number of years in the training data. The thresholds are (a) 95% and (b) 99%. Unet and AnEn are tested for 2016, with training from previous years. Verification results are aggregated using values for all lead times and locations.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
e. Model sensitivity to predictors
Unet is a nonlinear model with weights learned in a data-driven fashion. Although it is difficult to pinpoint which mechanism has been identified and learned to help postprocess precipitation forecasts, we are still able to visualize the relative sensitivity of model forecasts with respect to small changes in model input, via integrated gradient (IG).
The IG (Sundararajan et al. 2017; Mudrakarta et al. 2018; Sayres et al. 2019; McCloskey et al. 2019; Sundararajan et al. 2019) is a gradient-based attribution method that quantifies the contribution of each input predictor by calculating the product of gradients and input values, similar to a linear system, where the contribution of each predictor is determined as the multiplication of the predictor value and its coefficient. Given a model input and a reference input, IG integrates the gradient and quantifies how a change in the input, relative to the reference, would affect the model output. A reference input can be a Gaussian blurred version of the model input (background signal) which provides little to no detailed information for Unet to make a useful prediction. IG then calculates gradients to measure the relationship between changes to an input feature and changes in the model output. It is a nonintrusive method, meaning that it can be directly applied to any trained DL models without modifying the architecture.
Figure 13 shows the forecast sensitivity of μ and σ with respect to small changes in predictors. The feature sensitivity is calculated using IG (Mudrakarta et al. 2018; Adebayo et al. 2020) and then normalized to ensure a sum of one for all features. It is worthwhile to point out that feature sensitivity is not a direct measure of feature importance, but it is a good indicator of how each input feature changes the prediction output. For both precipitation intensity and forecast uncertainty, precipitation appears as the most sensitive input feature as expected due to its high autocorrelation with the predictand. IVT and IWV are identified to be the next two most sensitive features after precipitation. This relationship is consistent with our knowledge that they are among the most important variables to explain variations in precipitation. For example, ARs are long and narrow corridors of enhanced IWV and IVT, primarily driven by a precold-frontal low-level jet stream of an extratropical cyclone (American Meteorological Society 2022). On the West Coast, over the study domain, ARs account for 30%–50% of the annual precipitation (Oakley et al. 2018). Unet successfully identified this nonlinear relation between the IWV/IVT and precipitation as a result of the data-driven model training.

Forecast sensitivity of (a) μ and (b) σ with respect to changes in model predictors. Feature sensitivity is estimated using an integrated gradient and then normalized to ensure a sum of one for all features.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1

Forecast sensitivity of (a) μ and (b) σ with respect to changes in model predictors. Feature sensitivity is estimated using an integrated gradient and then normalized to ensure a sum of one for all features.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Forecast sensitivity of (a) μ and (b) σ with respect to changes in model predictors. Feature sensitivity is estimated using an integrated gradient and then normalized to ensure a sum of one for all features.
Citation: Monthly Weather Review 151, 6; 10.1175/MWR-D-22-0268.1
Comparing Figs. 13a and 13b, the sensitivity of forecast uncertainty is more likely to be related to multiple features, as opposed to the predominant effect from precipitation when predicting rain intensity. This suggests, when estimating forecast uncertainty, Unet tends to identify patterns by examining the interaction between multiple features.
The benefit of having multiple features can be further observed in the supplemental material, Fig. 3. It is worthwhile to note that CSGD and MMGD use precipitation as the only predictor, but data-driven methods like AnEn and Unet are more flexible at ingesting multiple predictors. Having multiple predictors improves the performance of both AnEn and Unet, although the existence of a deep architecture seems to play a bigger role suggested by the smaller difference in performance (shown from the solid red and dashed red lines).
5. Summary and conclusions
In this work, Unet has been applied to postprocessing NWP forecasts and generating high-resolution 0–4-day probabilistic forecasts for precipitation. The Unet learns the distributional parameters associated with a censored, shifted gamma distribution. The objective evaluation shows that the Unet outperforms the benchmark methods at all lead times, specifically having the best performance for extreme events, i.e., the 95th and 99th percentiles of the distribution.
Compared to traditional parametric and nonparametric postprocessing methods, Unet benefits from its DL architecture that can easily incorporate more predictor variables and extracts spatial information. Since these complex spatial patterns are stored as bottleneck features (Fig. 2), Unet benefits from having a large parameterization so that more patterns can be encoded. These patterns are not location-dependent, meaning that patterns can be detected across different parts of the domain. In terms of forecast errors and spread, all four methods show high spread–skill correlations which indicate that the spread generated from postprocessed probabilistic forecasts can be good first-order estimates of the predictive skill, although AnEn and Unet show better correlations for forecasts with large spread (>10 mm).
Learning distribution and forecast uncertainty from deterministic models and observations is a challenging task because we only have one realization of the model and one reality that we can observe. One difficulty when training the Unet using CRPS is that network parameters are not trained with physical constraints. As a result, model training can lead to unreasonable estimates and cause numerical instability in the computation of CRPS. Ideally, it is useful to have a “true” forecast probability distribution (Anderson 1996) obtained with perfect initialization and a perfect model. In reality, observational errors are inevitable, and the verifying observations are merely samples from the “true” forecast probability distribution. In this study, we used an additional regularization term to constrain the optimization of standard deviation. This regularization has been found effective for avoiding under-dispersive distributions and improving model stability.
In this work, we also tested the performance sensitivity to data volume sizes. Results show that traditional statistical postprocessing methods usually perform better with limited data. However, as more data become available (roughly more than eight years of training data in our case), ML starts to outperform traditional methods. This result demonstrates the ability of ML to learn the highly nonlinear relationship between forecasts and observations when presented with enough training data.
The proposed framework in this work can provide robust and reliable input for downstream applications such as hydrological forecasting and water resource management. However, one limitation is that spatially and temporally consistent members are usually needed for hydrological simulations, but Unet only generates the distributional parameters. Future research could focus on producing spatially and temporally consistent ensemble members from prescribed distributions. This work also builds on top of a deterministic forecast model, West-WRF. Future work can investigate how Unet can be applied to the calibration of ensemble model output. Alternatively, future research could study how various encoder–decoder architectures, i.e., the generative adversarial network (Goodfellow Ian et al. 2014), could be used to preserve detailed spatial information and generate realistic precipitation patterns. On a different note, this project primarily focuses on precipitation forecasting, but since the Unet is capable of learning spatial dependencies, it would be interesting to test its multivariate performance. For example, future research could apply copula-based technique like ensemble copula coupling (Schefzik et al. 2013; Schefzik 2017) to CSGD and compare with the Unet.
Acknowledgments.
This work is supported by the California Department of Water Resources Ph3 AR research program (Award 4600014294) and the Forecast Informed Reservoir Operations Award (USACE W912HZ1920023).
Data availability statement.
The West-WRF data are available from the forecast research landing page of Center for Western Weather and Water Extremes at https://cw3e.ucsd.edu/west-wrf/. The PRISM data are available from the Oregon State University at https://www.prism.oregonstate.edu/.
APPENDIX A
Unet Model Parameters and Configurations
Table A1 lists the model parameters and configuration for each layer in Unet. The layers are connected in sequence from the top of the table to the bottom. For example, the input layer is connected to Pad_0. Then the output of Pad_0 is fed into Downscample_0_Conv_0, and so forth.
Unet parameter details and configurations. Padding and cropping layer parameters include [[top_pad, bottom_pad], [left_pad, right_pad]]. Convolutional layer parameters include [height, width, features], stride, padding. MaxPooling layer parameters include size, stride, padding. The X represents the sample size. Output shapes include [samples, height, width, features].


Skip connections are represented by the layers, Concatenate_0, Concatenate_1, and Concatenate_2. These layers aggregate the output of previous layers without running additional convolutional operations. They carry out concatenation of the features. As a result, skip connections are not trainable and they do not introduce additional parameters into the model.
The layers, DeConv_0, DeConv_1, and DeConv_2, are deconvolutional layers, also referred to as transposed convolutions or fractionally strided convolutions. In its essence, kernels define convolutions, and convolutions can be represented by matrix multiplication. Convolutions compute the forward pass using the kernels. Instead, deconvolutions transpose the matrix and carry out backward computation of the kernels. Details of deconvolutions can be found in section 4.1 and 4.2 of Dumoulin and Visin (2016).
APPENDIX B
Probabilistic Metrics
a. CRPS
We chose CRPS as a proper score because probabilistic verification of the forecasted distribution is of interest in this work. To ensure a fair comparison of different types of forecasts, we computed the integral of Brier scores for all models in verification. For CSGD, MMGD, and Unet, the Brier score is computed from CDF. For AnEn (ensemble forecast), since the forecast is not associated with a prescribed distribution, the CRPS is computed via the empirical distribution.
b. Brier score
The first term quantifies reliability, and it shows how close the forecasted probability is to the observed frequency. And therefore, the lower the value, the better the reliability. The second term quantifies resolution. It shows how much the conditional probability given the different forecasts is different from climatology. In other words, the resolution is the predictive skill of a forecast system to predict events that are different from climatology. Therefore, the higher the value, the better the resolution. The third term quantifies the inherent uncertainty in the outcomes of the event.
c. Binned spread–skill correlation
Binned spread–skill correlation (Murphy 1988) measures the consistency between the spread of an ensemble system and its predictive skill. The forecast spread is usually calculated as the standard deviation and the skill is usually calculated as RMSE of the ensemble mean. However, even for a perfect ensemble system, the correlation can be rather low since the spread is an estimate of the expected value of the error. Therefore, forecast–observation pairs are usually first binned, before calculating their correlation.
The steps to calculate and visualize binned spread–skill correlation are as follows:
-
Calculate the variance of the ensemble.
-
Calculate the squared error of the ensemble mean.
-
Create bins with intervals of equal size, or intervals with the same number of samples, based on ensemble variance.
-
Calculate the average variance in each bin and take the square root for the standard deviation.
-
Calculate RMSE in each bin. For ensemble forecasts, the mean squared error needs to be multiplied by n/(n + 1), with n being the number of ensemble members, before taking the root.
-
Plot RMSE as a function of the standard deviation.
A well-calibrated ensemble system should closely follow the one-on-one diagonal line. Lines laying to the upper left of the referential line indicate overconfidence, and lines laying to the lower right of the referential line indicate underconfidence.
REFERENCES
American Meteorological Society, 2022: Atmospheric river. Glossary of Meteorology. https://glossary.ametsoc.org/wiki/Atmospheric_river.
Adebayo, J., J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, 2020: Sanity checks for saliency maps. arXiv, 1810.03292v3, https://doi.org/10.48550/arXiv.1810.03292.
Alessandrini, S., L. Delle Monache, S. Sperati, and J. N. Nissen, 2015: A novel application of an analog ensemble for short-term wind power forecasting. Renewable Energy, 76, 768–781, https://doi.org/10.1016/j.renene.2014.11.061
Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9, 1518–1530, https://doi.org/10.1175/1520-0442(1996)009<1518:AMFPAE>2.0.CO;2.
Badrinathat, A., L. Delle Monache, N. Hayatbini, W. Chapman, F. Cannon, and M. Ralph, 2022: Improving precipitation forecasts with convolutional neural networks. Wea. Forecasting, 38, 291–306, https://doi.org/10.1175/WAF-D-22-0002.1.
Baran, S., and D. Nemoda, 2016: Censored and shifted gamma distribution based EMOS model for probabilistic quantitative precipitation forecasting. Environmetrics, 27, 280–292, https://doi.org/10.1002/env.2391.
Baran, S., and S. Lerch, 2018: Combining predictive distributions for the statistical post-processing of ensemble forecasts. Int. J. Forecasting, 34, 477–496, https://doi.org/10.1016/j.ijforecast.2018.01.005.
Bellier, J., G. Bontron, and I. Zin, 2017: Using meteorological analogues for reordering postprocessed precipitation ensembles in hydrological forecasting. Water Resour. Res., 53, 10 085–10 107, https://doi.org/10.1002/2017WR021245.
Bremnes, J. B., 2004: Probabilistic forecasts of precipitation in terms of quantiles using NWP model output. Mon. Wea. Rev., 132, 338–347, https://doi.org/10.1175/1520-0493(2004)132<0338:PFOPIT>2.0.CO;2.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 1–3, https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
Brown, J. D., L. Wu, M. He, S. Regonda, H. Lee, and D.-J. Seo, 2014: Verification of temperature, precipitation, and streamflow forecasts from the NOAA/NWS Hydrologic Ensemble Forecast Service (HEFS): 1. Experimental design and forcing verification. J. Hydrol., 519, 2869–2889, https://doi.org/10.1016/j.jhydrol.2014.05.028.
Buban, M. S., T. R. Lee, and C. B. Baker, 2020: A comparison of the U.S. climate reference network precipitation data to the Parameter-Elevation Regressions on Independent Slopes Model (PRISM). J. Hydrometeor., 21, 2391–2400, https://doi.org/10.1175/JHM-D-19-0232.1.
Chapman, W. E., A. C. Subramanian, L. Delle Monache, S. P. Xie, and F. M. Ralph, 2019: Improving atmospheric river forecasts with machine learning. Geophys. Res. Lett., 46, 10 627–10 635, https://doi.org/10.1029/2019GL083662.
Chapman, W. E., A. C. Subramanian, S.-P. Xie, M. D. Sierks, F. M. Ralph, and Y. Kamae, 2021: Monthly modulations of ENSO teleconnections: Implications for potential predictability in North America. J. Climate, 34, 5899–5921, https://doi.org/10.1175/JCLI-D-20-0391.1.
Chapman, W. E., L. D. Monache, S. Alessandrini, A. C. Subramanian, F. M. Ralph, S.-P. Xie, S. Lerch, and N. Hayatbini, 2022: Probabilistic predictions from deterministic atmospheric river forecasts with deep learning. Mon. Wea. Rev., 150, 215–234, https://doi.org/10.1175/MWR-D-21-0106.1.
Corringham, T. W., F. M. Ralph, A. Gershunov, D. R. Cayan, and C. A. Talbot, 2019: Atmospheric rivers drive flood damages in the western United States. Sci. Adv., 5, eaax4631, https://doi.org/10.1126/sciadv.aax4631.
Daly, C., W. P. Gibson, G. H. Taylor, G. L. Johnson, and P. Pasteris, 2002: A knowledge-based approach to the statistical mapping of climate. Climate Res., 22, 99–113, https://doi.org/10.3354/cr022099.
Delle Monache, L., F. A. Eckel, D. L. Rife, B. Nagarajan, and K. Searight, 2013: Probabilistic weather prediction with an analog ensemble. Mon. Wea. Rev., 141, 3498–3516, https://doi.org/10.1175/MWR-D-12-00281.1.
Demargne, J., and Coauthors, 2014: The science of NOAA’s operational hydrologic ensemble forecast service. Bull. Amer. Meteor. Soc., 95, 79–98, https://doi.org/10.1175/BAMS-D-12-00081.1.
Dettinger, M. D., and D. R. Cayan, 2014: Drought and the California Delta—A matter of extremes. San Francisco Estuary Watershed Sci., 12, 7, https://doi.org/10.15447/sfews.2014v12iss2art4.
Dettinger, M. D., F. M. Ralph, T. Das, P. J. Neiman, and D. R. Cayan, 2011: Atmospheric rivers, floods and the water resources of California. Water, 3, 445–478, https://doi.org/10.3390/w3020445.
Drozdzal, M., E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, 2016: The importance of skip connections in biomedical image segmentation. Deep Learning and Data Labeling for Medical Applications, G. Carneiro et al., Eds., Lecture Notes in Computer Science, Springer International Publishing, 179–187, https://doi.org/10.1007/978-3-319-46976-8_19.
Dumoulin, V., and F. Visin, 2016: A guide to convolution arithmetic for deep learning. arXiv, 1603.07285v2, https://doi.org/10.48550/arXiv.1603.07285.
Eckel, F. A., and L. Delle Monache, 2016: A hybrid NWP–analog ensemble. Mon. Wea. Rev., 144, 897–911, https://doi.org/10.1175/MWR-D-15-0096.1.
Ghazvinian, M., 2021: Improving probabilistic quantitative precipitation forecasting using machine learning and statistical postprocessing methods. Ph.D. thesis, Dept. of Civil Engineering, The University of Texas at Arlington, 194 pp., https://rc.library.uta.edu/uta-ir/handle/10106/30923.
Ghazvinian, M., D.-J. Seo, and Y. Zhang, 2019: Improving medium-range probabilistic quantitative precipitation forecast for heavy-to-extreme events through the conditional bias-penalized regression. 2019 Fall Meeting, San Francisco, CA, Amer. Geophys. Union, Abstract H33P-2245, https://agu.confex.com/agu/fm19/meetingapp.cgi/Paper/517742.
Ghazvinian, M., Y. Zhang, and D.-J. Seo, 2020: A nonhomogeneous regression-based statistical postprocessing scheme for generating probabilistic quantitative precipitation forecast. J. Hydrometeor., 21, 2275–2291, https://doi.org/10.1175/JHM-D-20-0019.1.
Ghazvinian, M., Y. Zhang, D.-J. Seo, M. He, and N. Fernando, 2021: A novel hybrid artificial neural network—Parametric scheme for postprocessing medium-range precipitation forecasts. Adv. Water Resour., 151, 103907, https://doi.org/10.1016/j.advwatres.2021.103907.
Ghazvinian, M., Y. Zhang, T. M. Hamill, D.-J. Seo, and N. Fernando, 2022: Improving probabilistic quantitative precipitation forecasts using short training data through artificial neural networks. J. Hydrometeor., 23, 1365–1382, https://doi.org/10.1175/JHM-D-22-0021.1.
Gleick, J., 2008: Chaos: Making a New Science. Penguin, 384 pp.
Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 1098–1118, https://doi.org/10.1175/MWR2904.1.
Goodfellow, I., Y. Bengio, and A. Courville, 2016: Deep Learning. MIT Press, 800 pp.
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, 2014: Generative adversarial nets. Proc. 27th Int. Conf. on Neural Information Processing Systems, Z. Ghahramani et al., Eds., MIT Press, Vol. 2, 2672–2680.
Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 3209–3229, https://doi.org/10.1175/MWR3237.1.
Hamill, T. M., and M. Scheuerer, 2018: Probabilistic precipitation forecast postprocessing using quantile mapping and rank-weighted best-member dressing. Mon. Wea. Rev., 146, 4079–4098, https://doi.org/10.1175/MWR-D-18-0147.1.
Hamill, T. M., M. Scheuerer, and G. T. Bates, 2015: Analog probabilistic precipitation forecasts using GEFS reforecasts and climatology-calibrated precipitation analyses. Mon. Wea. Rev., 143, 3300–3309, https://doi.org/10.1175/MWR-D-15-0004.1.
Han, L., H. Liang, H. Chen, W. Zhang, and Y. Ge, 2022: Convective precipitation nowcasting using U-net model. IEEE Trans. Geosci. Remote Sens., 60, 1–8, https://doi.org/10.1109/TGRS.2021.3100847.
Herr, H. D., and R. Krzysztofowicz, 2005: Generic probability distribution of rainfall in space: The bivariate model. J. Hydrol., 306, 234–263, https://doi.org/10.1016/j.jhydrol.2004.09.011.
Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559–570, https://doi.org/10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.
Hopson, T., 2014: Assessing the ensemble spread–error relationship. Mon. Wea. Rev., 142, 1125–1142, https://doi.org/10.1175/MWR-D-12-00111.1.
Hu, W., and G. Cervone, 2019: Dynamically optimized unstructured grid (DOUG) for analog ensemble of numerical weather predictions using evolutionary algorithms. Comput. Geosci., 133, 104299, https://doi.org/10.1016/j.cageo.2019.07.003.
Hu, W., D. Del Vento, and S. Su, Eds., 2020: Proceedings of the 2020 improving scientific software conference. UCAR/NCAR Tech. Rep., NCAR/TN-564+PROC, https://doi.org/10.5065/P2JJ-9878.
Hu, W., G. Cervone, G. Young, and L. Delle Monache, 2023: Machine learning weather analogs for near-surface variables. Bound.-Layer Meteor., 186, 711–735, https://doi.org/10.1007/s10546-022-00779-6.
Iacono, M. J., J. S. Delamere, E. J. Mlawer, M. W. Shephard, S. A. Clough, and W. D. Collins, 2008: Radiative forcing by long-lived greenhouse gases: Calculations with the AER radiative transfer models. J. Geophys. Res., 113, D13103, https://doi.org/10.1029/2008JD009944.
Ishida, K., M. L. Kavvas, and S. Jang, 2015: Comparison of performance on watershed-scale precipitation between WRF and MM5. World Environmental and Water Resources Congress 2015, Austin, TX, American Society of Civil Engineers, 989–993, https://doi.org/10.1061/9780784479162.095.
Jasperse, J., and Coauthors, 2020: Lake Mendocino forecast informed reservoir operations final viability assessment. Scripps Institution of Oceanography, 142 pp., https://escholarship.org/uc/item/3b63q04n.
Kelly, K. S., and R. Krzysztofowicz, 1997: A bivariate meta-Gaussian density for use in hydrology. Stochastic Hydrol. Hydraul., 11, 17–31, https://doi.org/10.1007/BF02428423.
Kim, S., and Coauthors, 2018: Assessing the skill of medium-range ensemble precipitation and streamflow forecasts from the Hydrologic Ensemble Forecast Service (HEFS) for the upper trinity river basin in North Texas. J. Hydrometeor., 19, 1467–1483, https://doi.org/10.1175/JHM-D-18-0027.1.
Lei, H., H. Zhao, and T. Ao, 2022: A two-step merging strategy for incorporating multi-source precipitation products and gauge observations using machine learning classification and regression over China. Hydrol. Earth Syst. Sci., 26, 2969–2995, https://doi.org/10.5194/hess-26-2969-2022.
Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 3515–3539, https://doi.org/10.1016/j.jcp.2007.02.014.
Lewis, W. R., W. J. Steenburgh, T. I. Alcott, and J. J. Rutz, 2017: GEFS precipitation forecasts and the implications of statistical downscaling over the western United States. Wea. Forecasting, 32, 1007–1028, https://doi.org/10.1175/WAF-D-16-0179.1.
Li, M., T. Zhang, Y. Chen, and A. J. Smola, 2014: Efficient mini-batch training for stochastic optimization. Proc. 20th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, Association for Computing Machinery, 661–670, https://doi.org/10.1145/2623330.2623612.
Li, W., B. Pan, J. Xia, and Q. Duan, 2022: Convolutional neural network-based statistical post-processing of ensemble precipitation forecasts. J. Hydrol., 605, 127301, https://doi.org/10.1016/j.jhydrol.2021.127301.
Long, J., E. Shelhamer, and T. Darrell, 2015: Fully convolutional networks for semantic segmentation. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, CVF, 3431–3440, https://openaccess.thecvf.com/content_cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_CVPR_paper.html.
Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20, 130–141, https://doi.org/10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.
Mao, X., C. Shen, and Y.-B. Yang, 2016: Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. Advances in Neural Information Processing Systems 29 (NIPS 2016), D. Lee et al., Eds., Vol. 29, Curran Associates, Inc., https://proceedings.neurips.cc/paper/2016/hash/0ed9422357395a0d4879191c66f4faa2-Abstract.html.
Martin, A., F. M. Ralph, R. Demirdjian, L. DeHaan, R. Weihs, J. Helly, D. Reynolds, and S. Iacobellis, 2018: Evaluation of atmospheric river predictions by the WRF Model using aircraft and regional Mesonet observations of orographic precipitation and its forcing. J. Hydrometeor., 19, 1097–1113, https://doi.org/10.1175/JHM-D-17-0098.1.
Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci., 22, 1087–1096, https://doi.org/10.1287/mnsc.22.10.1087.
McCloskey, K., A. Taly, F. Monti, M. P. Brenner, and L. J. Colwell, 2019: Using attribution to decode binding mechanism in neural network models for chemistry. Proc. Natl. Acad. Sci. USA, 116, 11 624–11 629, https://doi.org/10.1073/pnas.1820657116.
Mudrakarta, P. K., A. Taly, M. Sundararajan, and K. Dhamdhere, 2018: Did the model understand the question? Proc. 56th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), Melbourne, Australia, Association for Computational Linguistics, 1896–1906, https://aclanthology.org/P18-1176.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600, https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
Murphy, J. M., 1988: The impact of ensemble forecasts on predictability. Quart. J. Roy. Meteor. Soc., 114, 463–493, https://doi.org/10.1002/qj.49711448010.
Oakley, N. S., F. Cannon, E. Boldt, J. Dumas, and F. M. Ralph, 2018: Origins and variability of extreme precipitation in the Santa Ynez River Basin of Southern California. J. Hydrol. Reg. Stud., 19, 164–176, https://doi.org/10.1016/j.ejrh.2018.09.001.
Pan, B., K. Hsu, A. AghaKouchak, and S. Sorooshian, 2019: Improving precipitation estimation using convolutional neural network. Water Resour. Res., 55, 2301–2321, https://doi.org/10.1029/2018WR024090.
Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 1155–1174, https://doi.org/10.1175/MWR2906.1.
Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 3885–3900, https://doi.org/10.1175/MWR-D-18-0187.1.
Ronneberger, O., P. Fischer, and T. Brox, 2015: U-Net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, N. Navab et al., Eds., Lecture Notes in Computer Science, Vol. 9351, Springer International Publishing, 234–241, https://doi.org/10.1007/978-3-319-24574-4_28.
Sayres, R., and Coauthors, 2019: Using a deep learning algorithm and integrated gradients explanation to assist grading for diabetic retinopathy. Ophthalmology, 126, 552–564, https://doi.org/10.1016/j.ophtha.2018.11.016.
Schefzik, R., 2017: Ensemble calibration with preserved correlations: Unifying and comparing ensemble copula coupling and member-by-member postprocessing. Quart. J. Roy. Meteor. Soc., 143, 999–1008, https://doi.org/10.1002/qj.2984.
Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling. Stat. Sci., 28, 616–640, https://doi.org/10.1214/13-STS443.
Scheuerer, M., and T. M. Hamill, 2015: Statistical postprocessing of ensemble precipitation forecasts by fitting censored, shifted gamma distributions. Mon. Wea. Rev., 143, 4578–4596, https://doi.org/10.1175/MWR-D-15-0061.1.
Scheuerer, M., T. M. Hamill, B. Whitin, M. He, and A. Henkel, 2017: A method for preferential selection of dates in the Schaake shuffle approach to constructing spatiotemporal forecast fields of temperature and precipitation. Water Resour. Res., 53, 3029–3046, https://doi.org/10.1002/2016WR020133.
Scheuerer, M., M. B. Switanek, R. P. Worsnop, and T. M. Hamill, 2020: Using artificial neural networks for generating probabilistic subseasonal precipitation forecasts over California. Mon. Wea. Rev., 148, 3489–3506, https://doi.org/10.1175/MWR-D-20-0096.1.
Sha, Y., D. J. G. Ii, G. West, and R. Stull, 2020a: Deep-learning-based gridded downscaling of surface meteorological variables in complex terrain. Part I: Daily maximum and minimum 2-m temperature. J. Appl. Meteor. Climatol., 59, 2057–2073, https://doi.org/10.1175/JAMC-D-20-0057.1.
Sha, Y., D. J. G. Ii, G. West, and R. Stull, 2020b: Deep-learning-based gridded downscaling of surface meteorological variables in complex terrain. Part II: Daily precipitation. J. Appl. Meteor. Climatol., 59, 2075–2092, https://doi.org/10.1175/JAMC-D-20-0058.1.
Smith, L. N., 2017: Cyclical learning rates for training neural networks. 2017 IEEE Winter Conf. on Applications of Computer Vision (WACV), Santa Rosa, CA, Institute of Electrical Engineers, 464–472, https://doi.org/10.1109/WACV.2017.58.
Strachan, S., and C. Daly, 2017: Testing the daily PRISM air temperature model on semiarid mountain slopes. J. Geophys. Res. Atmos., 122, 5697–5715, https://doi.org/10.1002/2016JD025920.
Sundararajan, M., A. Taly, and Q. Yan, 2017: Axiomatic attribution for deep networks. Proc. 34th Int. Conf. on Machine Learning, Vol. 70, Sydney, Australia, PMLR, 3319–3328, https://proceedings.mlr.press/v70/sundararajan17a.html.
Sundararajan, M., S. Xu, A. Taly, R. Sayres, and A. Najmi, 2019: Exploring principled visualizations for deep network attributions. Joint Proc. ACM IUI 2019 Workshops, Vol. 4, Los Angeles, CA, ACM IUI, 11 pp., https://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-16.pdf.
Taillardat, M., A.-L. Fougères, P. Naveau, and O. Mestre, 2019: Forest-based and semiparametric methods for the postprocessing of rainfall ensemble forecasting. Wea. Forecasting, 34, 617–634, https://doi.org/10.1175/WAF-D-18-0149.1.
Theis, S. E., A. Hense, and U. Damrath, 2005: Probabilistic precipitation forecasts from a deterministic model: A pragmatic approach. Meteor. Appl., 12, 257–268, https://doi.org/10.1017/S1350482705001763.
Tong, T., G. Li, X. Liu, and Q. Gao, 2017: Image super-resolution using dense skip connections. Proc. IEEE Conf. on Computer Vision (ICCV), Venice, Italy, CVF, 4799–4807, https://openaccess.thecvf.com/content_iccv_2017/html/Tong_Image_Super-Resolution_Using_ICCV_2017_paper.html.
Valdez, E. S., F. Anctil, and M.-H. Ramos, 2022: Choosing between post-processing precipitation forecasts or chaining several uncertainty quantification tools in hydrological forecasting systems. Hydrol. Earth Syst. Sci., 26, 197–220, https://doi.org/10.5194/hess-26-197-2022.
Van den Dool, H., 1989: A new look at weather forecasting through analogues. Mon. Wea. Rev., 117, 2230–2247, https://doi.org/10.1175/1520-0493(1989)117<2230:ANLAWF>2.0.CO;2.
Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a Big Data world. Bull. Amer. Meteor. Soc., 102, E681–E699, https://doi.org/10.1175/BAMS-D-19-0308.1.
Wang, Q. J., D. E. Robertson, and F. H. S. Chiew, 2009: A Bayesian joint probability modeling approach for seasonal forecasting of streamflows at multiple sites. Water Resour. Res., 45, W05407, https://doi.org/10.1029/2008WR007355.
Wang, X., and C. H. Bishop, 2003: A comparison of breeding and ensemble transform Kalman filter ensemble forecast schemes. J. Atmos. Sci., 60, 1140–1158, https://doi.org/10.1175/1520-0469(2003)060<1140:ACOBAE>2.0.CO;2.
Wilks, D. S., 2009: Extending logistic regression to provide full-probability-distribution MOS forecasts. Meteor. Appl., 16, 361–368, https://doi.org/10.1002/met.134.
Wilks, D. S., 2019: Statistical forecasting. Statistical Methods in the Atmospheric Sciences, D. S. Wilks, Ed., Elsevier, 298–299.
Wu, L., D.-J. Seo, J. Demargne, J. D. Brown, S. Cong, and J. Schaake, 2011: Generation of ensemble precipitation forecast from single-valued quantitative precipitation forecast for hydrologic ensemble prediction. J. Hydrol., 399, 281–298, https://doi.org/10.1016/j.jhydrol.2011.01.013
Wu, L., Y. Zhang, T. Adams, H. Lee, Y. Liu, and J. Schaake, 2018: Comparative evaluation of three Schaake shuffle schemes in postprocessing GEFS precipitation ensemble forecasts. J. Hydrometeor., 19, 575–598, https://doi.org/10.1175/JHM-D-17-0054.1.
Zhang, Y., L. Wu, M. Scheuerer, J. Schaake, and C. Kongoli, 2017: Comparison of probabilistic quantitative precipitation forecasts from two postprocessing mechanisms. J. Hydrometeor., 18, 2873–2891, https://doi.org/10.1175/JHM-D-16-0293.1.