1. Introduction
The precipitation associated with atmospheric rivers (ARs), “a long, narrow, and transient corridor of strong horizontal water vapor transport” (American Meteorological Society 2019), replenishes the water supply but can also result in flooding over the western United States. ARs cause median economic losses in the tens to hundreds of millions of dollars for class-4 and class-5 ARs, based on the AR scale developed by Ralph et al. (2019). ARs have been identified as the primary source of flooding in the western United States (Corringham et al. 2019). Accurate and reliable predictions of precipitation can help in minimizing losses attributable to ARs or other weather phenomena (e.g., cutoff lows, narrow cold-frontal rainbands, etc.) and in better managing the water supply in the western United States (O’Donnell et al. 2020; Jasperse et al. 2020).
Numerical weather prediction (NWP) consists of dynamical models that are built on current state-of-the-science knowledge of key atmospheric physics and numerical procedures. However, NWP accuracy is affected by initial condition errors, numerical approximations, and incomplete understanding and representation of all the relevant physical processes (Delle Monache et al. 2013; Vannitsem and Ghil 2017; Collins and Allen 2002; Nicolis and Nicolis 2007).
NWP postprocessing methods are designed to correct for the aforementioned deficiencies by learning the characteristics of NWP errors from a historical dataset to then try to anticipate today’s forecast biases. These include downscaling methods, Kalman filters, model output statistics (MOS), and machine learning methods such as neural network models, decision trees, and multilinear regression models (Louka et al. 2008; Glahn and Lowry 1972). Historically, postprocessing methods, including machine learning methods, have operated on a point-by-point basis (Rasp and Lerch 2018).
Traditional point-by-point approaches have been shown to be effective in improving the raw estimates of dynamical models (Glahn and Lowry 1972), and are particularly valuable for certain applications, e.g., renewable energy (Alessandrini et al. 2015; Cervone et al. 2017). However, since spatial interdependence is typically ignored in these methods without additional feature engineering, at times nonphysical fields with statistical anomalies are produced (Vannitsem and Ghil 2017). Further, for the specific case of precipitation forecasting, the predominantly “no rain” data points in the precipitation field often pose an issue for out-of-the-box machine learning methods (i.e., without additional techniques), which rely on balanced classes of data.
Neural methods have been proposed for forecasting precipitation as far back as 1999 by Hall et al. (1999). More recently, Rasp and Lerch (2018) proposed neural networks to postprocess ensemble weather forecasts. Deep neural networks have since been used for postprocessing and forecasting tasks with success (Haupt et al. 2021; Vannitsem et al. 2021). Grönquist et al. (2021) achieved an improvement of over 14% in continuous ranked probability score (CRPS) over traditional ensemble methods using deep learning for postprocessing weather forecasts. Ghazvinian et al. (2021) developed a neural network-based scheme which minimizes the CRPS from a prescribed parametric forecast distribution (censored, shifted gamma) for rainfall prediction.
Convolutional neural networks (CNN) are a type of neural network that leverage spatial interdependence by construction. CNNs have previously been shown to be a powerful tool in the domains of image analysis (Krizhevsky et al. 2012). In 2016, CNNs were proposed to correct satellite retrievals (Tao et al. 2016). In the domain of forecasting, Ham et al. (2019) showed the utility and improved accuracy of convolutional neural networks for multiyear ENSO forecasts. Further, recent work by Chapman et al. (2019, 2022) has highlighted the efficacy of a CNN as a NWP postprocessing method for the prediction of integrated vapor transport (IVT). It was shown that the CNN-based prediction resulted in 9%–17% improvements in RMSE as compared to other methods (Chapman et al. 2019). Meech et al. (2020) postprocessed rainfall using a CNN, resulting in noteworthy overall error reduction. However, it was noticed that this was at the expense of underestimating the highest precipitation, incurring large model bias.
To explore the potential of further improving predictions of accumulated precipitation based on point-based estimates by capturing spatial precipitation patterns, we test and further develop a recently proposed CNN machine learning method: a U-Net architecture (Ronneberger et al. 2015). In comparison to other architectures, the U-Net architecture typically outputs predictions of the same dimensionality as the input, with skip connections that preserve high-resolution spatial information. They have been used for both classification and regression tasks, primarily focused on spatial data in the field of biomedical imaging Ronneberger et al. (2015). Additionally, we propose and test a dual-regression model structure and a binary masking model that leverage both regression and classification to rectify the class imbalance in the sparse precipitation data. To address issues of underprediction observed by Meech et al. (2020), we propose and use a modified loss function that strongly penalizes underprediction.
2. Data and methodology
a. Observational data
The observed precipitation data used in this study is the Parameter-elevation Relationships on Independent Slopes Model (PRISM) dataset (PRISM Climate Group 2004), which is constructed using data from the Cooperative Observer Program (COOP) and Snowpack Telemetry (SNOTEL) networks, and a variety of smaller networks (Daly et al. 2008). PRISM provides estimates of accumulated 24-h precipitation data over the last 40 years over the contiguous United States (CONUS) at a spatial resolution of 4 km. Here we focus on the western U.S. region, comprising California and Nevada.
The PRISM dataset was chosen as ground truth in this study due to its accuracy, comparable spatial resolution to the model reforecast data, and length of record. PRISM uses a comprehensive linear precipitation–elevation correction scheme that applies weights based on location to nearby stations, proximity to coast, topographic facets, boundary layer conditions, surrounding terrain height, and other terrain features (Daly et al. 2008). PRISM has been shown to perform well in challenging complex terrain settings when tested against independent station data (Daly et al. 2017). It has also been shown to produce reliably similar estimates of precipitation extremes when compared to other national in situ gridded datasets, while performing notably better than various reanalysis products (Gibson et al. 2019).
b. Model reforecast data
The NWP reforecast was developed at the Center for Western Weather and Water Extremes. As input to the U-Net CNN, we use weather forecasts over a domain (Fig. 4) with horizontal grid spacing of 3 km. We use 34 water years (1985 to 2019) of the Western Weather Research and Forecasting (West-WRF) regional model (Martin et al. 2019) covering the western United States, California and Nevada. This model is based on WRF version 4.1.2. The forecasts are driven by initial and boundary conditions from the Global Forecasting System. The West-WRF regional model has shown forecast skill with a low intensity error for IVT and reduced dry and wet biases for precipitation over lead times from 1 to 5 days. For additional details about the physical parameterization schemes used in the West-WRF reforecasts, technical aspects of model development, and an in-depth analysis of its improvements over existing dynamical methods, please refer to Cannon et al. (2020).
To align the forecast spatially with the observation set at a 4-km resolution, we leverage a nearest-neighbor interpolation scheme, to retain existing precipitation patterns and preserve global precipitation means. For temporal alignment with the observations, and given that the forecasts are initialized daily at 0000 UTC, we calculate the accumulated daily precipitation with a 12-h offset to account for model spinup. That is, data from 12 to 36 h after initialization of each West-WRF forecast are labeled as the day-1 forecast and is aligned with PRISM ground truth data.
c. Machine learning approach
The proposed CNN architecture is structurally similar to the widely used U-Net, named after its distinctive U-shape model diagram (Ronneberger et al. 2015). Historically, this type of CNN has been used for biomedical image segmentation, but its application with weather forecasts is promising given its strength in rectifying spatial biases (Chapman et al. 2019).
Our model architecture consists of two phases, with a total of 13 layers, as shown in Fig. 1. The number of total layers was decided based using the value that minimized the chosen loss function in Eq. (1) on the validation set based on a grid search with U-Net configurations with 5, 9, and 13 convolutional layers, corresponding to chosen depths of 2, 3, and 4. Note that the depth refers to the number of times the pooling and upsampling layers are applied (denoted by orange and red arrows in Fig. 1).
In the contraction phase, denoted in by the left side of Fig. 1 containing the orange arrows, the model learns salient spatial feature representations of the input data through repeated convolutional and pooling layers. Note that in this study, convolution refers strictly to two-dimensional convolution as described in Ronneberger et al. (2015). Throughout the contraction phase, the data’s first two spatial dimensions corresponding to the number of rows and columns shrink due to the pooling layers, which apply a chosen aggregation function over a “pool” of adjacent pixels (the convention is to choose maximum pooling, which yields the maximum pixel value in a set of adjacent pixels). Correspondingly, the number of output feature dimensions increases as the number of filters (i.e., chosen number of output channels, corresponding to the final dimension) within the convolutional layers increase.
This is followed by an expanding phase, denoted in by the right side of Fig. 1 containing the red arrows, in which the output image is reconstructed using the learned feature representations through repeated convolutional and upsampling layers. Further, the purple arrows from corresponding convolutional layer outputs from the contraction phase to the expanding phase, as shown in Fig. 1, denote skip connections. This allows for high-dimensional spatial featurization to be retained despite shrinking spatial dimensions corresponding to the number of rows and columns of the data. In this study, we implement skip connections by concatenating corresponding outputs from the contraction phase with the inputs in the expanding phase.
We modified the U-Net architecture as introduced by Ronneberger et al. (2015) in several ways (as detailed below) to adapt it to the postprocessing task of improving the skill of precipitation forecasts. The model along with these modifications is referred to as the modified U-Net CNN from here onward.
West-WRF model output variables are used as predictors in the CNN for regression problem of numerical precipitation prediction. In particular, to generate day-1 predictions, the 24-h accumulated precipitation, and the 6-, 12-, and 18-h forecasts of 5-m specific humidity and 2-m temperature since forecast initialization are used as input to the model. All the data are normalized based on division by their maximum values across the training data. Similarly, for greater lead times, we use the same predictors offset by the lead times. That is, for day-2 predictions, we use 48-h forecasts of accumulated precipitation, 30-, 36-, and 42-h forecasts of 5-m specific humidity and 2-m temperature. These predictors are used because they provide insight into the ground truth precipitation.
We determined the ideal set of input features using the validation loss. Starting with a baseline set of features of specific humidity and accumulated precipitation, we performed a grid search with the remaining features (temperature, wind, pressure) and time granularity among every 6 h and every 12 h. It was determined through the lowest validation loss that as the number of input parameters was increased beyond these predictors and time granularity, the computational efficiency and accuracy of the model decreased (Anelli et al. 2019). The two predictors that were not used as they provided negligible improvement in validation metrics were pressure and wind.
The loss function used for the modified U-Net CNN is an asymmetric adaptation of the mean squared error that penalizes underprediction more than overprediction, as shown in Eq. (1). It was observed through preliminary testing that the U-Net CNN tended to systematically underpredict the top 10% of precipitation events sorted by greatest average rainfall, hence we chose to correct this bias as follows. We assign a hyperparameter (i.e., a model configuration-related parameter that is not optimized through iterative procedures such as gradient descent) ws > 1 that multiplicatively weights underpredicted values. The value of ws is determined by minimizing the chosen loss function in Eq. (1) on the validation dataset, which is consistent with the procedure to determine all hyperparameters. For lead times of 1 and 2 days, ws was set to 1.1, and for lead times of 3 and 4 days, ws was set to 1.3; these were chosen from potential values of 1.1, 1.3, and 1.5. This effectively penalizes underprediction by an extra 10% in shorter lead times and by an extra 30% for longer lead times. Validation confirmed observations that underprediction was more severe for later lead times, and hence a higher penalty was applied.
We propose using two different regression models to mitigate class imbalance between the occurrence of heavy and moderate precipitation events in the regression task. Henceforth, we will refer to this as the “dual-regression model approach.” For extreme events, we observed that machine learning methods tended to underestimate the upper tail of the distribution and overestimate the moderate cases, due to the relatively low probability of extreme values in the distribution (Meech et al. 2020). To address this issue, we create separate U-Net CNN models for the heavy precipitation events as classified by mean West-WRF forecast accumulated precipitation above 2.5 mm. This corresponds with roughly all events above the 80th percentile in total accumulated precipitation, which was determined through the validation loss to be an effective separation for the dual-regression model to mitigate the class imbalance issue. For the remaining 20% of events, we train a separate U-Net model on the moderate cases. Through this, we accomplish a tailored model for both heavy and moderate precipitation (hence, the dual-regression model approach). Note that in training and testing phases, the predetermined threshold used with the West-WRF forecast input to determine which model must train or test on a particular data point is 2.5 mm, and we do not use the PRISM observations to determine this threshold or which dual-regression model to use. The overall model architecture, including the dual-regression model approach and binary classification model for binary masking, is shown in Fig. 2.
While there exist deep learning techniques that resolve class imbalances in a more formal way such as data augmentation, they rely on mutating the data (e.g., stretching or cropping), which may be less desirable for postprocessing problems with a numerical output (Perez and Wang 2017). These techniques produce an augmented input, but the numerical output (ground truth) then needs to be augmented too. Other techniques, such as sampling the data through a weighted scheme to rebalance the ratio between heavy and moderate precipitation events, reduced accuracy severely on moderate events.
We leverage Keras, a standard machine learning library that supports construction of convolutional neural networks, to develop our machine learning framework (Chollet et al. 2015). Parameter tuning for the learning rate, the initial number of filters per layer, and loss function weights is accomplished through validation, and the chosen values are shown in Table 1. Using a grid search with 2–4 chosen values for each hyperparameter (learning rate, initial number of filters, kernel size, and pooling size), we used the values of the hyperparameters that minimized the mean squared error on the validation set. Note that the number of filters was not constant for each layer; the number of initial convolutional filters began at 64 and doubled at every following depth level. We did not use batch normalization, dropout, or L1/L2 regularization, and all other relevant hyperparameters and design decisions (such as the activation function being ReLU, number of filters doubling at each depth level, two convolutional layers at each depth level, etc.) are consistent with a standard implementation of the original U-Net architecture Ronneberger et al. (2015).
Considered and optimal (shown in bold) choices for each hyperparameter based on the minimum validation loss for the modified U-Net architecture used in this study. Note all of these model architecture-related hyperparameters were constant across the models trained on all 4 lead times (1–4 days).
d. Testing, evaluation, and baseline methods
The CNN is evaluated over a chosen test set of 4 water years, which were selected based on categorical El Niño–Southern Oscillation years. We use one El Niño year (1997/98), one La Niña year (2011/12), and two ENSO neutral years: one historically wet and one dry year (years 2016/17 and 2013/14, respectively). ENSO years have been shown to dramatically affect west coast precipitation regimes through large-scale pressure patterns that strongly alter precipitation predictability (Chapman et al. 2021; Kumar and Hoerling 1998). We also select particularly wet (2016/17) and dry (2013/14) years in which ENSO is in a neutral state, representing California drought conditions and a surplus of precipitation, respectively, without tropical sea surface temperature forcing. We choose these years in order to test the skill of our methods in varied climate regimes and on a variety of precipitation events. The years directly prior to the chosen years (i.e., 1996/97, 2010/11, 2015/16, 2012/13) serve as a validation set. The rest serves as the training set.
We use a testing process that most closely mimics a production system, but it varies from the typical machine learning training/testing procedure in some ways. To test the model on one particular year of data drawn from the test set (e.g., say 1997/98), we train our proposed model over all possible years except that singular testing year (e.g., 1997/98) and the corresponding validation year (the latter used to tune the hyperparameters, e.g., in this case, 1996/97). This process is repeated for all years in the test set, so we train four dual-regression models and binary masking models in total, each of which is not trained on their corresponding test and validation year. Using the predictions of each of these four models on their corresponding test year (e.g., 1997/98 for the first model untrained on the first year of the test set), we generate an overall test loss. This is similar to the cross-validation algorithm in machine learning applications. We refer to this as “one-shot” training.
Traditional machine learning and dynamical postprocessing frameworks are compared to the proposed U-Net CNN to ensure its predictive accuracy over the chosen test set. Further, they offered a baseline for the CNN’s forecasting skill. A prediction based on climatology was used to ensure that the CNN is consistent. It is constructed by averaging 30 days of observation data prior to any particular testing day over all years preceding it. The second comparison is with the West-WRF dynamical model, which is used as the input to the machine learning method. As such, any rectification of spatial or temporal biases over the West-WRF model would be directly reflected in the CNN’s accuracy and errors.
Further, we implement a MOS based on L1-regularized multilinear regression (Tibshirani 1996) in SKLearn. Using L1 regularization adds an additional absolute penalty term on the magnitude of the learned parameter θ in multilinear regression; typically, this results in feature selection through zeroing of less salient feature values. The selection of the multiplier on the penalty term λ is accomplished through a grid search of candidates {0.0001, 0.0005, 0.001} using validation loss as the metric to minimize (optimal value was 0.0001). The MOS presents a more traditional ML framework that can be used as a baseline to the CNN. Similar to many other ML frameworks, the MOS leverages point-based learning, as opposed to the strategy adopted in a CNN. Note that the multilinear regression is configured to use the same predictors (precipitation, humidity, temperature) as the CNN and uses the same “one-shot” training for consistency.
We evaluate the model using the following metrics: root-mean-square error (RMSE), mean absolute error (MAE), model bias (BIAS), critical success index (CSI), and Pearson correlation (PC). For CSI, we use a threshold of 0.1 mm. These metrics provide a comprehensive aggregated point-by-point analysis of the CNN’s performance with regards to the regression error and the categorical accuracy.
Similarly to Sperati et al. (2017), to verify the spatial consistency of the predictions generated by each of the methods, we also compare the pairwise correlation between all pairs of grid points for the predictions with the observations. When the pairwise correlation between a chosen model’s grid points (e.g., the CNN) more closely matches the pairwise correlation for the ground truth grid points, it indicates a greater degree of correspondence in terms of spatial relationships in the ground truth.
3. Results and discussion
In this section we discuss quantitative, categorical, spatial, and temporal evaluations of the CNN in comparison with West-WRF and MOS, to assess the overall performance and the skill for extreme events. We also diagnose the interpretability of the CNN to illustrate its behavior by dissecting its internal hidden representations. We then further analyze the method’s performance by analyzing the predictive skill as a function of lead time and verifying the spatial consistency of the produced precipitation fields.
a. Discussion of evaluation metrics
The results for an overall comparison between the CNN and all models used as baselines are shown in Table 2. The models are compared with respect to all the error metrics defined in section 2d: RMSE, MAE, PC, and CSI (with threshold of 0.1 mm). These are aggregated over all lead times (1–4 days) and over all grid points. All of the shown metrics and improvements throughout the results section with intervals are bootstrap sampled (with 1000 samples) and produced with a 95% confidence interval to indicate if the results are statistically significant.
Overall comparison between chosen models aggregated over all lead times and grid points. Intervals indicate the 95% confidence level, and bold font indicates the best performing method for each metric.
The CNN’s overall RMSE aggregated over the 4 lead times (1–4 days) consistently outperforms climatology by 34.1%–37.0%, West-WRF by 12.9%–15.9%, and MOS by 7.4%–8.9%, where intervals indicate 95% confidence intervals. Similarly, the CNN outperforms both West-WRF and MOS for all 4 lead times with respect to Pearson correlation (PC) by 2.7%–3.4% and 3.3%–4.2%, respectively. Over the same period, the CNN improves upon West-WRF’s CSI by 0.6%–1.5%, with greater improvements ranging from 2.7% to 5.6% for lead times of 24–48 h (not shown). Note that we do not provide a complete set of improvement statistics for CLIM apart from RMSE since it is consistently 40%–50% improved upon with regard to every metric.
Further, we present the results in terms of percentage improvements for heavy precipitation events. Table 3 shows the CNN’s improvement over all other models for all metrics for the top 10% heavy precipitation events in the test set. For reference, this corresponds with events with an average precipitation across the region of more than around 6 mm, which is at the right tail of the distribution. Note that the intervals indicate 95% confidence intervals.
Error metric comparison for the top 10% heavy precipitation events between chosen models aggregated over all lead times and grid points. Intervals indicate the 95% confidence level. Note that CSI is used with a threshold of 0.1 mm.
The performance of the models on the top 10% most heavy precipitation events again indicates that the CNN based postprocessing leads to the lowest forecast error. The CNN’s overall RMSE/MAE over these events is 19.8%–21.0%/17.7%–18.3% and 8.8%–9.7%/5.4%–6.2% smaller compared to West-WRF and MOS, respectively, over all lead times of 1–4 days. Further, the CNN’s PC over these events is improved by 4.9%–5.5% and 4.2%–4.7% compared to West-WRF and MOS. There is slight, but statistically significant improvement in CSI of 0.87%–0.96% by the CNN over West-WRF on these events.
The latter two metrics, PC and CSI, showcase the spatial and categorical accuracy of the methods, and RMSE summarizes the numerical accuracy. Thus, the CNN clearly outperformed the other postprocessing and dynamical methods over all lead times with respect to spatial, categorical, and numerical accuracy aggregated over heavy precipitation and all events. These improvements are all shown to be statistically significant over a 95% confidence interval.
We present a set of plots showing observed versus predictions precipitation values in Fig. 3 for each model to visualize conditional biases in the models, which we refer to as reliability curves. Since there are approximately 100 million test points across lead times and grid points, we discretize the data for visualization purposes for all subfigures in Fig. 3. We construct 500 precipitation bins from 0 to 150 mm (representing an approximate minimum and maximum for the field in the test set) and aggregate the predicted and observed data within each bin using the median.
Since the data are skewed heavily, we visualize the relationship between the predicted and observed values using log–log axes in Figs. 3a–c. To further indicate the skew, we plot the 90th, 99th and 99.9th percentile in the model’s predicted values as dotted vertical lines. Figures 3a–c indicate that the CNN performs better for the vast majority of the data, up to the 99th percentile. In comparison, both West-WRF and MOS more heavily underpredict the moderate events ranging from ground truth precipitation of 1 to 10 mm. However, all methods seem to underpredict as compared to overpredict, especially for the heaviest precipitation beyond the 99th percentile of precipitation. The underprediction is more consistent for the CNN and MOS as compared to West-WRF for these events. Note that compared to Table 3, Figs. 3a–f plot precipitation at all individual grid points whereas Table 3 shows the model’s improvement across the entire domain for the heaviest precipitation events (i.e., for the days with greatest total accumulated precipitation).
Despite the more consistent underprediction by the CNN, its performance in comparison to RMSE for precipitation values beyond the 99th percentile is still statistically significantly better. The CNN’s RMSE improvement over West-WRF on events with an observed precipitation value of over 30 mm (roughly corresponding with the 99th percentile in the observations) is 6.7%–7.0%, where the intervals denote a 95% confidence interval. Qualitatively, we explain this result through Figs. 3d–f, which do not feature log–log axes. Through these figures, it is clear that despite less consistent underprediction by West-WRF, the variance (i.e., spread) around the ideal y = x line is greater, which leads to greater error for the heaviest precipitation events.
This performance is qualitatively consistent with other results reported in the literature regarding machine learning-based deterministic postprocessing methods. In a study on postprocessing rainfall in a simulation of the 1994 Piedmont flood by Meech et al. (2020), the authors reported an improvement of 32% in terms of RMSE over WRF, using a CNN over a testing period of 2 months. For the input to their CNN, they used 3 forecast time steps and 169 WRF locations. These improvements are over a smaller time frame and a more specialized dataset, but they are somewhat similar to the lesser 21% improvement achieved by our CNN on a longer and varied testing period (4 years versus 2 months) and spatial extent (almost the entirety of California and a portion of Nevada versus the northwest of Italy).
b. Interpretability of CNN’s results
We present two avenues of interpreting the model’s behavior. The first involves exploring the model’s postprocessing behavior on a case study where the CNN achieves an improvement in terms of RMSE comparable to the overall improvement in RMSE (i.e., it is an input on which the CNN’s improvement over other methods is average). To accomplish this, we will display the intermediate test set output values from particular layers in the CNN to characterize its behavior.
Model behavior on a case study
We analyze the CNN’s behavior on a sample event compared to other models. Then, we decompose and visualize intermediate outputs from the CNN to make sense of its inner workings. This can reveal some physically based intuition that it has learned through training.
Figure 4 shows an example of a 96-h forecast of an extreme event that occurred on 10 February 2014, in the test set. All methods predict a precipitation event centered at around 39°N, 121°W, with roughly the same shape and pattern. However, the West-WRF method shows high bias. Both the CNN and MOS are able to correct for this overestimated magnitude. Further, the CNN demonstrates the ability to remove the incorrect “striped” patterns within West-WRF that are not present in the PRISM ground truth, which the MOS does not. Quantitatively, the CNN displays 33.9% improvement in RMSE over West-WRF and 5.6% improvement in RMSE over MOS. Therefore, compared to both methods, the CNN both qualitatively and quantitatively more closely resembles the observation patterns of the event as estimated by PRISM, especially within the heavy precipitation regions.
To further analyze the model behavior, we sample and visualize the output of the model at various stages of processing the input. This can provide a direct view into the intermediate representations that the model builds to postprocess precipitation. If these are physically meaningful in some way, it can build intuition as to how the model is successful in accomplishing the task and motivate a wider adoption of these types of methods. Note that this process was done for multiple such randomly sampled inputs and similar behavior was observed from the CNN.
For the same sample event on 10 February 2014, we sample the third and ninth convolutional layer outputs, which occur near the beginning and the end of the CNN’s processing, respectively. It can be seen in Fig. 1. We rank the layer outputs by their sum of squares per filter and select the 3 filter outputs with the greatest squared norm (i.e., sum of squares). Note that the filter outputs correspond to the final dimension of the output of each CNN, where the number of filters is controlled by the desired number of channels outputted by each convolutional layer, as described in Fig. 1.
Due to the size of the network, the layer outputs are sparse (i.e., most filter outputs are zero), so we only visualize the most information-dense filter outputs by ranking them as explained. We denote the visualized layer and filter outputs by the layer number and filter number, where L3 F146 describes the 146th filter output of the third convolutional layer.
We display the sampled layer outputs, along with the accumulated precipitation that served as model input, in Fig. 5. For all the outputs from the ninth convolutional layer, we add an ϵ = 0.01 to each element of the matrix and elementwise invert. This is because it is observed that all mappings in layer 9 contain extremely high values for regions of no precipitation and near-zero values for regions of higher precipitation. For the sake of visualization, we invert the image, but the physical information represented within the matrix is unchanged. Note that the magnitudes of these intermediate outputs are not included, since they are not physically meaningful.
All of the intermediate layer outputs show a clear indication that the CNN is interpreting the same spatial region as the region of interest throughout all stages of processing. This is a preliminary indication that the inner workings of the CNN conform to physical constraints and logical expectation.
The visualizations of the feature space output by the third layer in Figs. 5b and 5c show that the CNN has picked up on the boundary of the region and the Channel Islands off the coast of California. This is a reasonable expected behavior, and it is consistent with past literature, which notes that CNNs initially pick up broad features (e.g., horizontal and vertical edges) before constructing representations of detailed structures in the image. In this case, the relevant edges are those of the boundaries denoting the region of interest.
For the ninth layer, we see more developed views and representations of the postprocessing task. Figure 5d shows the areas of highest precipitation, denoted by extremely high values of the inverse convolutional output (corresponding to extremely low values of the raw convolutional output), whereas Fig. 5e resembles the complement at first glance. In that case, we would expect the sum of the two filters to resemble a roughly uniform field. However, the inverted sum of the filters shows the contours of the precipitation pattern, which suggests that the CNN has learned to identify, represent, and bound regions of high precipitation.
The way in which the CNN featurizes the input and constructs the output can be seen through this evolution in the learned representations of precipitation fields in feature space from the third to the ninth layer. This results in its ability to postprocess accumulated precipitation qualitatively and quantitatively better than other methods analyzed in this study.
c. Temporal evaluation of models
We show some of the error metrics (RMSE, CRMSE, BIAS, PC) for each postprocessing and dynamical method as a function of the lead time in Fig. 6. This allows a more thorough examination of the propagation of error through increasing lead times.
Throughout the 4 lead times, the CNN consistently has the lowest CRMSE, as well as the highest Pearson correlation. The CNN is consistently able to add a day worth of predictive skill when compared to West-WRF in terms of RMSE, CRMSE, and PC. For example, the CNN’s CRMSE and RMSE at day 4 is nearly equal to the CRMSE and RMSE of West-WRF at day 2 and is well below the corresponding values at day 3. We can conclude that the CNN most effectively reduces the random component of the error and conditional biases and increases the Pearson correlation, as compared to both West-WRF and MOS.
The BIAS fluctuates for each postprocessing and forecasting method, but it is lower than the CRMSE by approximately an order of magnitude (10x) and therefore contributes only marginally to the RMSE. In other words, the average distance from any model’s prediction to the ground truth prediction is less than 0.8 mm for any method; hence the squared BIAS that contributes to the RMSE2 (or MSE) is at most 0.64 mm2. Using Eq. (3), this translates to less than 2% of the magnitude of RMSE2 (or in other words, the MSE) being caused by the BIAS. As such, it is reasonable to not derive any conclusions from the magnitude of the BIAS of each model, as it is inconsequential in the decomposition of the error.
In the study conducted by Meech et al. (2020) to postprocessing the Piedmont 1994 floods using a CNN, they showed qualitatively similar results, with RMSE and PC being largely improved as compared to WRF. They remarked that one issue with the model was its tendency of underestimation. While a similar problem is still observed here as shown by the negative BIAS, it is markedly less noteworthy. Whereas they observed a BIAS of around −1.64 mm over a 2-month period, we report a typical BIAS of −0.3 mm across all lead times. In terms of contribution to MSE, the CNN trained by Meech et al. (2020) contributed 2.69 mm2, whereas our CNN contributed only around 0.09 mm2.
Further, we evaluate the rate of growth of the RMSE to evaluate the CNN’s capabilities of producing longer-term forecasts and the scaling of the error as a function of lead time. A slower rate of growth of the error metric would indicate a method that tracks better as a function of lead time; the lesser the error increases from one lead time to the next, the greater is the predictive power that the model retains. The average rate of growth for RMSE is far higher for West-WRF between days 2–4 as compared to the CNN, which demonstrates 17.9% less error incurred from day 3 to 4. Similarly, the average rate of decay for PC is reduced by 16.6% for the CNN as compared to West-WRF over days 2–4.
d. Spatial evaluation of models
Many postprocessing schemes produce improved results, but they struggle with producing spatially consistent and physically meaningful fields (Clark et al. 2004). We test the ability of the CNN to generate a skillful prediction while preserving the spatial consistency of precipitation patterns as represented in PRISM throughout the region of interest (Sperati et al. 2017).
We first plot the spatial patterns of improvement in error metrics as compared to West-WRF, aggregated over lead times from day 1 to day 4. The result is shown in Fig. 7. Note that we do not perform statistical significance tests or compute confidence intervals for each grid point as it is computationally intractable across four metrics, 1000 bootstrap samples, and 57 600 total pixels.
The improvements in RMSE and MAE are consistently above 10%, with noteworthy improvements of around 30%–40% in the Sierra Nevada region. Similarly, the CNN improves upon West-WRF’s Pearson correlation coefficient by 5% or more, with improvements of 10%–15% in southern California. The sharp decrease in correlation in the northern region and throughout the California Channel Islands is likely attributed to the CNN’s documented weaknesses to domain boundaries due to spatial padding for convolution (Alsallakh et al. 2020).
The CNN’s improvements in CSI (with a threshold of 0.1 mm) are largely mixed, with coastal California showing around 10% improvement over West-WRF. In southern California and Nevada, the West-WRF model outperforms the CNN by 15%. However, it is important to note that regions in which the CNN more severely underperforms (the region highlighted in red) account for only 9.2% of the total precipitation in the region (i.e., they are dry areas).
The spatial consistency of the generated precipitation field is also examined using a pairwise correlation plot (Sperati et al. 2017). This is an important aspect of the forecast evaluation, because it explores the ability of the CNN to capture the spatial distribution of observed precipitation. In some instances, machine learning methods can perform well numerically, but the produced fields themselves are physically inconsistent, perhaps due to artifacts in the field, non-smooth patterns (i.e., sharpness), or extreme outliers (Clark et al. 2004). In those cases, the relationship between each pair of grid points in the model output would behave unlike those in the PRISM observations. As a result, the pairwise correlations between the PRISM observations’ grid points would not match the pairwise correlations of the grid points of the model output (since the relationship between each pair of grid points is different). The lesser the relationship between grid points matches that of the PRISM observations, the less spatially consistent the model output is.
The pairwise correlation plot is shown in Fig. 8 for both the CNN and West-WRF methods. With a perfect forecasting or postprocessing method, we expect the correlation between each of the grid cells to match with the observation set, as shown by the 1:1 line in orange. The actual distribution of pairwise correlations between the CNN and West-WRF with respect to the PRISM is shown as a density plot. Qualitatively, it is noted that the CNN maintains the spatial attributes of the PRISM observations just as well as West-WRF, by the fact that the spread is just as concentrated along 1:1 line. The higher coefficient of determination (R2) of the CNN pairwise correlation plot indicates that the dispersion around the identity is lower than that of the West-WRF pairwise correlation plot. This indicates the CNN’s superior spatial consistency with the PRISM ground truth as compared to West-WRF. Note that this analysis does not factor in the observational error.
4. Conclusions
The U-Net convolutional neural network (CNN) architecture, originally proposed by Ronneberger et al. (2015), and adapted in this study for precipitation prediction, provides a computationally efficient and consistently accurate postprocessing framework over different types of water years that outperforms competing machine learning and dynamical models.
It provides statistically significant, superior spatial consistency and numerical accuracy over all lead times, as summarized by the 12.9%–15.9% improvement in root-mean-square error (RMSE) over the Western Weather Research and Forecasting (West-WRF) Model and 7.4%–8.9% improvement over model output statistics (MOS). Categorical metrics such as CSI (threshold 0.1 mm) are improved for the CNN by 0.6%–1.5% and 82%–83% compared to West-WRF and MOS. Additionally, the CNN outperforms the other methods for the prediction of the top 10% heaviest precipitation events by a larger margin. The CNN shows a 19.8%–21.0% and 8.8%–9.7% improvement in RMSE over West-WRF and MOS, respectively, for these events. Despite the CNN’s superior predictive skill over the majority of events, we observe an underprediction issue for beyond the 99th percentile of the heaviest precipitation events.
We examine the CNN’s performance on events across different forecasting lead times and across the region of interest to verify its temporal and spatial consistency. The CNN displays a reduced rate of error growth, such as RMSE and Pearson correlation, as lead times increase, which effectively results in more than a day of additional predictive skill with respect to West-WRF. In terms of spatial consistency, the CNN demonstrates greater skill in preserving ground-truth pairwise correlations between grid points compared to West-WRF. This demonstrates a consistent postprocessing framework that improves upon spatial and temporal biases over dynamical models and other postprocessing methods over the western United States.
Further, there is a lot of potential for the CNN method to improve operational forecasts. Provided a dynamical model forecast, the CNN is able to postprocess it in less than 100 ms owing to the optimized convolution operations implemented within Keras. With large upfront cost training the model and minimal cost associated with using the model for operational postprocessing, it presents potential for future study of using CNNs for operational forecasting.
Additional future work includes examining the temporal association between day-to-day forecasts, using recurrent neural networks or transformers along with an encoding convolutional neural network. The convolutional LSTM layer, developed by Shi et al. (2015), provides a promising avenue to explore this further. Additionally, we propose refining the binary masking technique based on feature sharing with the regression network and class imbalance techniques such as tuning the probability threshold for binary classification, i.e., choosing a value other than 0.5. This may contribute in remedying the CNNs tendency to underpredict the heavier precipitation. To further address the underprediction issue, we propose following the bias correction technique used for rare events proposed by Alessandrini et al. (2019). Finally, we propose using a broader feature set with predictors such as IVT that may be useful for precipitation prediction.
Acknowledgments.
This research was supported by USACE FIRO Grant W912HZ1520019 and CDWR AR Program Grant 4600013361.
Data availability statement.
The precipitation data from the West-WRF model and PRISM observational data and the model outputs to generate any of the results or models in this study are permanently archived in the UC San Diego Digital Library Collection (https://library.ucsd.edu/dc/). The data are around 600 GB in total. Please see Badrinath et al. (2023) for more details.
REFERENCES
Alessandrini, S., L. Delle Monache, S. Sperati, and J. N. Nissen, 2015: A novel application of an analog ensemble for short-term wind power forecasting. Renewable Energy, 76, 768–781, https://doi.org/10.1016/j.renene.2014.11.061.
Alessandrini, S., S. Sperati, and L. Delle Monache, 2019: Improving the analog ensemble wind speed forecasts for rare events. Mon. Wea. Rev., 147, 2677–2692, https://doi.org/10.1175/MWR-D-19-0006.1.
Alsallakh, B., N. Kokhlikyan, V. Miglani, J. Yuan, and O. Reblitz-Richardson, 2020: Mind the pad—CNNs can develop blind spots. arXiv, 2010.02178v1, https://doi.org/10.48550/arXiv.2010.02178.
American Meteorological Society, 2019: Atmospheric river. Glossary of Meteorology, http://glossary.ametsoc.org/wiki/atmospheric_river.
Anelli, V. W., T. Di Noia, E. Di Sciascio, C. Pomo, and A. Ragone, 2019: On the discriminative power of hyper-parameters in cross-validation and how to choose them. Proc. 13th ACM Conf. on Recommender Systems (RecSys’19), Copenhagen, Denmark, Association for Computing Machinery, 447–451, https://doi.org/10.1145/3298689.3347010.
Badrinath, A., L. Delle Monache, N. Hayatbini, W. Chapman, F. Cannon, and M. Ralph, 2023: Improving Precipitation Forecasts with Convolutional Neural Networks. UC San Diego Library Digital Collections, https://doi.org/10.6075/J0J103BG.
Cannon, F., and Coauthors, 2020: Observations and predictability of a high-impact narrow cold-frontal rainband over Southern California on 2 February 2019. Wea. Forecasting, 35, 2083–2097, https://doi.org/10.1175/WAF-D-20-0012.1.
Cervone, G., L. Clemente-Harding, S. Alessandrini, and L. Delle Monache, 2017: Short-term photovoltaic power forecasting using artificial neural networks and an analog ensemble. Renewable Energy, 108, 274–286, https://doi.org/10.1016/j.renene.2017.02.052.
Chapman, W. E., A. C. Subramanian, L. Delle Monache, S. P. Xie, and F. M. Ralph, 2019: Improving atmospheric river forecasts with machine learning. Geophys. Res. Lett., 46, 10 627–10 635, https://doi.org/10.1029/2019GL083662.
Chapman, W. E., A. C. Subramanian, S.-P. Xie, M. D. Sierks, F. M. Ralph, and Y. Kamae, 2021: Monthly modulations of ENSO teleconnections: Implications for potential predictability in North America. J. Climate, 34, 5899–5921, https://doi.org/10.1175/JCLI-D-20-0391.1.
Chapman, W. E., L. Delle Monache, S. Alessandrini, A. C. Subramanian, F. M. Ralph, S.-P. Xie, S. Lerch, and N. Hayatbini, 2022: Probabilistic predictions from deterministic atmospheric river forecasts with deep learning. Mon. Wea. Rev., 150, 215–234, https://doi.org/10.1175/MWR-D-21-0106.1.
Chollet, F., and Coauthors, 2015: Keras. GitHub, accessed 1 June 2020, https://github.com/fchollet/keras.
Clark, M., S. Gangopadhyay, L. Hay, B. Rajagopalan, and R. Wilby, 2004: The Schaake Shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. J. Hydrometeor., 5, 243–262, https://doi.org/10.1175/1525-7541(2004)005<0243:TSSAMF>2.0.CO;2.
Collins, M., and M. R. Allen, 2002: Assessing the relative roles of initial and boundary conditions in interannual to decadal climate predictability. J. Climate, 15, 3104–3109, https://doi.org/10.1175/1520-0442(2002)015<3104:ATRROI>2.0.CO;2.
Corringham, T. W., F. M. Ralph, A. Gershunov, D. R. Cayan, and C. A. Talbot, 2019: Atmospheric rivers drive flood damages in the western United States. Sci. Adv., 5, eaax4631, https://doi.org/10.1126/sciadv.aax4631.
Daly, C., M. Halbleib, J. I. Smith, W. P. Gibson, M. K. Doggett, G. H. Taylor, J. Curtis, and P. P. Pasteris, 2008: Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States. Int. J. Climatol., 28, 2031–2064, https://doi.org/10.1002/joc.1688.
Daly, C., M. E. Slater, J. A. Roberti, S. H. Laseter, and L. W. Swift Jr., 2017: High-resolution precipitation mapping in a mountainous watershed: Ground truth for evaluating uncertainty in a national precipitation dataset. Int. J. Climatol., 37, 124–137, https://doi.org/10.1002/joc.4986.
Delle Monache, L., F. A. Eckel, D. L. Rife, B. Nagarajan, and K. Searight, 2013: Probabilistic weather prediction with an analog ensemble. Mon. Wea. Rev., 141, 3498–3516, https://doi.org/10.1175/MWR-D-12-00281.1.
Ghazvinian, M., Y. Zhang, D.-J. Seo, M. He, and N. Fernando, 2021: A novel hybrid artificial neural network-parametric scheme for postprocessing medium-range precipitation forecasts. Adv. Water Resour., 151, 103907, https://doi.org/10.1016/j.advwatres.2021.103907.
Gibson, P. B., D. E. Waliser, H. Lee, B. Tian, and E. Massoud, 2019: Climate model evaluation in the presence of observational uncertainty: Precipitation indices over the contiguous United States. J. Hydrometeor., 20, 1339–1357, https://doi.org/10.1175/JHM-D-18-0230.1.
Glahn, H. R., and D. A. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor. Climatol., 11, 1203–1211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.
Grönquist, P., C. Yao, T. Ben-Nun, N. Dryden, P. Dueben, S. Li, and T. Hoefler, 2021: Deep learning for post-processing ensemble weather forecasts. Philos. Trans. Roy. Soc., A379, 20200092, https://doi.org/10.1098/rsta.2020.0092.
Hall, T., H. E. Brooks, and C. A. Doswell III, 1999: Precipitation forecasting using a neural network. Wea. Forecasting, 14, 338–345, https://doi.org/10.1175/1520-0434(1999)014<0338:PFUANN>2.0.CO;2.
Ham, Y.-G., J.-H. Kim, and J.-J. Luo, 2019: Deep learning for multi-year ENSO forecasts. Nature, 573, 568–572, https://doi.org/10.1038/s41586-019-1559-7.
Haupt, S. E., W. Chapman, S. V. Adams, C. Kirkwood, J. S. Hosking, N. H. Robinson, S. Lerch, and A. C. Subramanian, 2021: Towards implementing artificial intelligence post-processing in weather and climate: Proposed actions from the Oxford 2019 workshop. Philos. Trans. Roy. Soc., A379, 20200091, https://doi.org/10.1098/rsta.2020.0091.
Hayatbini, N., and Coauthors, 2019: Conditional generative adversarial networks (cGANS) for near real-time precipitation estimation from multispectral GOES-16 satellite imageries—PERSIANN-cGAN. Remote Sens., 11, 2193, https://doi.org/10.3390/rs11192193.
Jasperse, J., and Coauthors, 2020: Lake Mendocino forecast informed reservoir operations final viability assessment. Lake Mendocino FIRO Steering Committee, 141 pp., https://cw3e.ucsd.edu/FIRO_docs/LakeMendocino_FIRO_FVA.pdf.
Krizhevsky, A., I. Sutskever, and G. E. Hinton, 2012: ImageNet classification with deep convolutional neural networks. Proc. 25th Int. Conf. on Neural Information Processing Systems (NIPS’12), Vol. 25, Lake Tahoe, NV, Association for Computing Machinery, 1097–1105, https://dl.acm.org/doi/10.5555/2999134.2999257.
Kumar, A., and M. P. Hoerling, 1998: Annual cycle of Pacific–North American seasonal predictability associated with different phases of ENSO. J. Climate, 11, 3295–3308, https://doi.org/10.1175/1520-0442(1998)011<3295:ACOPNA>2.0.CO;2.
Louka, P., G. Galanis, N. Siebert, G. Kariniotakis, P. Katsafados, I. Pytharoulis, and G. Kallos, 2008: Improvements in wind speed forecasts for wind power prediction purposes using Kalman filtering. J. Wind Eng. Ind. Aerodyn., 96, 2348–2362, https://doi.org/10.1016/j.jweia.2008.03.013.
Martin, A. C., F. M. Ralph, A. Wilson, L. DeHaan, and B. Kawzenuk, 2019: Rapid cyclogenesis from a mesoscale frontal wave on an atmospheric river: Impacts on forecast skill and predictability during atmospheric river landfall. J. Hydrometeor., 20, 1779–1794, https://doi.org/10.1175/JHM-D-18-0239.1.
Meech, S., S. Alessandrini, W. Chapman, and L. Delle Monache, 2020: Post-processing rainfall in a high-resolution simulation of the 1994 Piedmont flood. Bull. Atmos. Sci. Technol., 1, 373–385, https://doi.org/10.1007/s42865-020-00028-z.
Nicolis, C., and S. C. Nicolis, 2007: Return time statistics of extreme events in deterministic dynamical systems. Europhys. Lett., 80, 40003, https://doi.org/10.1209/0295-5075/80/40003.
O’Donnell, A., and Coauthors, 2020: Estimating benefits of forecast-informed reservoir operations (FIRO): Lake Mendocino case-study and transferable decision support tool. 2020 Fall Meeting, online, Amer. Geophys. Union, Abstract SY015-0002.
Perez, L., and J. Wang, 2017: The effectiveness of data augmentation in image classification using deep learning. arXiv, 1712.04621v1, https://doi.org/10.48550/arXiv.1712.04621.
PRISM Climate Group, 2004: Prism data. Prism Climate Group, accessed 1 June 2020, http://prism.oregonstate.edu.
Ralph, F. M., J. J. Rutz, J. M. Cordeira, M. Dettinger, M. Anderson, D. Reynolds, L. J. Schick, and C. Smallcomb, 2019: A scale to characterize the strength and impacts of atmospheric rivers. Bull. Amer. Meteor. Soc., 100, 269–289, https://doi.org/10.1175/BAMS-D-18-0023.1.
Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 3885–3900, https://doi.org/10.1175/MWR-D-18-0187.1.
Ronneberger, O., P. Fischer, and T. Brox, 2015: U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention, N. Navab et al., Eds., Lecture Notes in Computer Science, Vol. 9351, Springer, 234–241.
Shi, X., Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo, 2015: Convolutional LSTM network: A machine learning approach for precipitation nowcasting. arXiv, 1506.04214v2, https://doi.org/10.48550/arXiv.1506.04214.
Sperati, S., S. Alessandrini, and L. Delle Monache, 2017: Gridded probabilistic weather forecasts with an analog ensemble. Quart. J. Roy. Meteor. Soc., 143, 2874–2885, https://doi.org/10.1002/qj.3137.
Tao, Y., X. Gao, K. Hsu, S. Sorooshian, and A. Ihler, 2016: A deep neural network modeling framework to reduce bias in satellite precipitation products. J. Hydrometeor., 17, 931–945, https://doi.org/10.1175/JHM-D-15-0075.1.
Tibshirani, R., 1996: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc., 58, 267–288, https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
Vannitsem, S., and M. Ghil, 2017: Evidence of coupling in ocean-atmosphere dynamics over the North Atlantic. Geophys. Res. Lett., 44, 2016–2026, https://doi.org/10.1002/2016GL072229.
Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681–E699, https://doi.org/10.1175/BAMS-D-19-0308.1.