## Abstract

This study utilizes Bayesian model averaging (BMA) as a framework to constrain the spread of uncertainty in climate projections of precipitation over the contiguous United States (CONUS). We use a subset of historical model simulations and future model projections (RCP8.5) from the Coupled Model Intercomparison Project phase 5 (CMIP5). We evaluate the representation of five precipitation summary metrics in the historical simulations using observations from the NASA Tropical Rainfall Measuring Mission (TRMM) satellites. The summary metrics include mean, annual and interannual variability, and maximum and minimum extremes of precipitation. The estimated model average produced with BMA is shown to have higher accuracy in simulating mean rainfall than the ensemble mean (RMSE of 0.49 for BMA versus 0.65 for ensemble mean), and a more constrained spread of uncertainty with roughly a third of the total uncertainty than is produced with the multimodel ensemble. The results show that, by the end of the century, the mean daily rainfall is projected to increase for most of the East Coast and the Northwest, may decrease in the southern United States, and with little change expected for the Southwest. For extremes, the wettest year on record is projected to become wetter for the majority of CONUS and the driest year to become drier. We show that BMA offers a framework to more accurately estimate and to constrain the spread of uncertainties of future climate, such as precipitation changes over CONUS.

## 1. Introduction

Annual precipitation averaged across the contiguous United States (CONUS) has increased approximately 4% from 1901 to 2015 (USGCRP 2017). Changes in precipitation are one of the most important potential outcomes of climate change because precipitation is a critical factor for the functioning of societies and ecosystems. Regional differences are apparent, as the Northeast, Midwest, and Great Plains have had increases while parts of the Southwest and Southeast have had decreases in precipitation (Walsh et al. 2014; Easterling et al. 2017). Seasonal differences are also apparent, as the northern United States received more precipitation in the spring relative to historical climate, parts of the southwestern United States received less in the winter and spring, and the northeast United States received more in the summer and fall (Peterson et al. 2013). Furthermore, heavy precipitation events in most parts of CONUS have increased in both intensity and frequency (Dettinger 2011; Janssen et al. 2014, 2016; Easterling et al. 2017). In particular, mesoscale convective systems, which are the main mechanism for warm season precipitation in the central part of the United States, have increased in occurrence and precipitation amounts since 1979 (Feng et al. 2016). Additionally, atmospheric rivers, which are narrow jets of integrated water vapor transport that have large impacts on local weather and regional hydrology, have also increased in frequency and intensity in recent decades (Dettinger 2011; Dettinger and Cayan 2014). While uncertainty remains in the magnitude, and in some places, the sign of projected change in mean precipitation, future increase in the frequency and intensity of extreme events is broadly projected across most parts of CONUS (Dettinger 2011; Pierce et al. 2013; Janssen et al. 2014, 2016; Payne and Magnusdottir 2015; Warner et al. 2015; Gao et al. 2015; Radić et al. 2015; Hagos et al. 2016; Shields and Kiehl, 2016a,b; Espinoza et al. 2018; Massoud et al. 2019a).

Changes in precipitation in a warmer climate are governed by many factors. A primary physical mechanism for increases in precipitation is the enhanced water vapor content in a warmer atmosphere, which enhances moisture convergence into storms (Wang et al. 2015). Although energy constraints can be used to understand global changes in precipitation (Shepherd 2014), projecting regional changes is much more difficult because of uncertainty in projecting changes in the large-scale circulation. For CONUS, future changes in seasonal average precipitation will include a mix of increases, decreases, or little change, depending on location and season (Easterling et al. 2017; USGCRP 2017). Over the globe, high-latitude regions are generally projected to become wetter, whereas the subtropical regions are projected to become drier. Since CONUS lies between these two regions, there is significant uncertainty about the sign and magnitude of future changes to precipitation in much of the region, particularly in the middle latitudes of the nation. However, since atmospheric water vapor will increase with increasing temperatures, confidence is high that precipitation extremes in the form of heavy rainfall will increase in frequency and intensity in the future throughout CONUS.

Through the Coupled Model Intercomparison Project (CMIP), the climate modeling community provides a suite of model simulations and projections (e.g., CMIP5; Taylor et al. 2012). These models are used to characterize climate projection uncertainty arising from model differences and large ensemble simulations (cf. Massoud et al. 2019b, 2020), as well as to characterize uncertainty inherent in the climate system due to internal variability (e.g., Kay et al. 2015). These ensembles provide an important resource for examining and evaluating the models that cause uncertainties in future climate projections (Lee et al. 2018). Often, when creating multimodel averages, projections of the future from each model are considered to be equally likely, without accounting for model skill or for the fact that some models are very similar to other models in the archive, which could lead to a biased weighting (Collins et al. 2013; Espinoza et al. 2018; Massoud et al. 2018). Owing to different model performances with respect to observations (Knutti and Sedláček 2013; Hidalgo and Alfaro, 2015; Gibson et al. 2019) and the lack of independence among models (Annan and Hargreaves 2011; Sanderson and Knutti 2012; Sanderson et al. 2015), there is evidence that giving equal weight to each available model projection is suboptimal (Knutti et al. 2010; Wenzel et al. 2014; Abramowitz and Bishop 2015; Alexander and Easterbrook 2015; Sanderson et al. 2015, 2017; Knutti et al. 2017; Eyring et al. 2019; Massoud et al. 2019a). The concept of model dependence becomes an issue when there is “double counting” of models that are simulated with common parameterizations or tuning practices, and therefore the assumption is that models that demonstrate dependence with other models should be penalized. These underlying assumptions have been challenged by a number of studies over recent years (Masson and Knutti 2011; Pennell and Reichler 2011; Knutti and Sedláček 2013; Sanderson et al. 2015; Knutti et al. 2017; Herger et al. 2018; Massoud et al. 2019a). Therefore, current advances in model averaging have focused on the concept of model skill as well as model independence (Knutti et al. 2010; Melillo et al. 2014; Herger et al. 2018; Massoud et al. 2019a).

There have been numerous studies that investigate the concept of model averaging based on model skill or independence. A number of studies have attempted to weigh models (only) accounting for model skill. For example, Tebaldi and Knutti (2007) proposed an ensemble averaging scheme that increased the weight of models with low observational biases, but the method potentially discounts outlier projections and does not consider model dependence. Bishop and Abramowitz (2013) proposed a method that produced a set of statistically independent model averages from the original archive, and applied this method to CMIP5 projections in Abramowitz and Bishop (2015). While their method minimizes the error of the model average compared to an observed target and is by definition optimal, the coefficients of each model can be positive or negative, and negative weights can no longer be directly interpreted as physical entities that conserve mass or energy. Another study is Langenbrunner and Neelin (2017), which used sampled equally weighted models created by a mixture distribution strategy and defined pareto optimal solutions to project changes in precipitation over California. However, this method also did not consider model independence. Sanderson et al. (2015) present a weighting strategy for use with climate model ensembles, which considers both skill in the climatological performance of models as well as the interdependency of models. This strategy introduced by Sanderson et al. has been implemented in several studies so far (e.g., Sanderson et al. 2015, 2017; Knutti et al. 2017; Massoud et al. 2019a), as well as in the Fourth National Climate Assessment report (Easterling et al. 2017; USGCRP 2017).

Eyring et al. (2019) thoroughly discuss the state of climate model evaluation and constraining the spread of uncertainty in future projections, and they show there is a need for better understanding advanced weighting methods to produce more tightly constrained climate projections. Therefore, there is a need for model averaging tools to more judiciously combine multimodel climate projections while considering the interdependency of models, such as the Bayesian model averaging (BMA) method (Hoeting et al. 1999). BMA is an approach that produces a multimodel average created from optimized model weights, which correspond to a distribution of weights for each model, such that the BMA-weighted model ensemble average for the historical simulation closely matches the observational reference constraint (see Fig. 1). In essence, the close fit to observations is a consequence of applying higher weights on more skillful models. Furthermore, since the BMA method estimates a distribution of model weights, various model combinations become possible, which explicitly takes care of the model dependence issue. To our best knowledge, BMA has not been widely used to constrain uncertainties in future projections from global climate models, with the exception of Massoud et al. (2019a) which used BMA to constrain the spread of uncertainty in future projections of global atmospheric rivers. There are other studies that used similar methods, such as Olson et al. (2016) and Fan et al. (2017) which used BMA to make probabilistic projections of temperature in southeast Australia from a suite of regional climate models. Olson et al. (2019) used a quasi-Bayesian method to weight climate model projections of summer mean maximum temperature change over Korea. Thus, in our study, we use BMA as a framework for model weighting that assesses model skill and independence, and as a result constrains the spread of uncertainty in future climate projections from a set of global climate models.

We implement BMA to estimate future precipitation changes over CONUS using a subset of CMIP5 models (historical and representative concentration pathway 8.5, or RCP8.5). The models are evaluated for fidelity with the Tropical Rainfall Measuring Mission (TRMM; Kummerow et al. 1998), which provides observed information on precipitation. In this study, we utilize five precipitation metrics, including mean daily rainfall, annual and interannual variability of the mean daily rainfall, and wet and dry extremes represented as the mean daily rainfall during the wettest and driest years on record, respectively (depicted in Fig. 2). These metrics are chosen to assure that we are not only rewarding models that have high skill in representing mean precipitation, but also other aspects of the precipitation distribution such as variability or extremes. This is one of the first studies to focus on and estimate future precipitation changes over CONUS using a weighted model strategy that is based on skill and independence (cf. Sanderson et al. 2017), and the first to use BMA as a framework to produce a multimodel average that constrains the spread of uncertainty in end-of-century precipitation projections over CONUS.

## 2. Data and methods

### a. Observations: TRMM and CPC precipitation

TRMM was a joint U.S.–Japan satellite mission for monitoring tropical and subtropical precipitation (Kummerow et al. 1998). The satellite includes a number of precipitation-related instruments, such as a precipitation radar, a visible and infrared sensor, and Special Sensor Microwave Imager microwave imager (Kummerow et al. 2000; Huffman et al. 2007; Almazroui 2011). We used the “daily accumulated precipitation” variable from the “TRMM 3B42v7 daily” product for our study, for the years 1998–2016. All TRMM estimates are regridded on a common grid (0.5° × 0.5°) using bilinear interpolation. Then, all the precipitation summary metrics applied in this study are computed, i.e., long-term mean daily precipitation, annual cycle and interannual variability in daily precipitation, and the wettest and driest years on record representing the extremes at each grid (explained in Fig. 2). The estimates for CONUS computed for each of the metrics obtained from TRMM are shown in Fig. 3 (note that some parts of Canada and Mexico are also shown in the maps).

Gibson et al. (2019) showed that climate model evaluation for precipitation indices over CONUS can be highly sensitive to the reference observational products used, such as in situ, satellite, or reanalysis data. As a result of this, it was important to test the sensitivity of the BMA weights trained using TRMM as a reference product against the use of other reference observational products. For this, we use the Climate Prediction Center (CPC) station data from the U.S. unified rain gauge dataset, which is composed of multiple sources (Higgins et al. 2000). The data record of the CPC observations is from 1950 to 2005. We apply the BMA model weight estimation to various time periods, including 1998–2005 (7 years), 1991–2005 (14 years), 1977–2005 (28 years), and 1950–2005 (55 years). We use the various time periods to minimize the influence of the results to climate variability.

### b. Model data: CMIP5 suite of models

Available climate models from the CMIP5 suite of models are used (Hibbard et al. 2007; Meehl and Hibbard 2007; Meehl et al. 2009) from the historical experiment and the future projected changes (RCP8.5). The models used in our study are listed in Table 1. All model simulations are regridded on a common grid (0.5° × 0.5°) using bilinear interpolation to match the resolution of the regridded TRMM estimates, which will allow direct comparison and model weighting. Then, all the precipitation summary metrics investigated in this study are computed. Once the models are prepared for evaluation against the TRMM observations, the BMA model weights can be estimated. All CMIP5 models used in this study are the “r1i1p1” version of the model simulations for the years 1998–2005, which were determined as the years of the simulations that overlap the TRMM data (i.e., TRMM data starts in 1998 and CMIP5 simulations end in 2005, so the period 1998–2005 is the maximum overlapping period that can be used for this analysis). The precipitation variable that was extracted from each model is the “precipitation_flux” which represents precipitation in units of kilograms per square meter per second (kg m^{−2} s^{−1}) but were converted to millimeters per day (mm day^{−1}) for this study.

### c. Precipitation metrics

In this study, we utilize five precipitation metrics to apply the BMA weighting. All metrics are depicted in Fig. 2. The first is mean daily rainfall, which is the long-term average of daily precipitation. Then the second metric is the annual variability of the mean daily rainfall and represents the annual cycle, which is a way to describe how much variance in precipitation occurs throughout the course of the year. The third metric is interannual variability, which describes the year-to-year variability. Then to describe precipitation extremes, the fourth metric is the mean daily rainfall during the wettest year on record, and the fifth metric is the mean daily rainfall during the driest year on record.

### d. Bayesian model averaging

Bayesian model averaging has been widely used in previous literature, and Hoeting et al. (1999) provide a comprehensive overview of its different variants. BMA differs from other model averaging methods as it explicitly estimates the weights and associated uncertainties of the weights for each model by maximizing a specified likelihood function, i.e., BMA obtains model weights which produce model combinations that have the maximum likelihood of matching historical observations compared to other model combinations. Using these optimized weights, BMA constructs the mean and uncertainty distribution of the performance metrics (or objective function) of interest. Applications of BMA have been described in works such as Raftery et al. (2005), Gneiting and Raftery (2005), Duan et al. (2007), Vrugt and Robinson (2007), Vrugt et al. (2008), Bishop and Shanley (2008), Olson et al. (2016, 2019), Fan et al. (2017), and Massoud et al. (2019a). The BMA method offers an alternative to the selection of a single model from a number of candidate models, by weighting each candidate model according to its statistical evidence, which is proportional to the model’s skill and independence (see Fig. 1). Since the BMA method estimates a distribution of model weights, various model combinations become possible, which implicitly takes care of the model dependence issue. In other words, consider that in the BMA framework there is a hypothetical Model A and a Model B that are similar and therefore not independent; Model A may have higher weights in some combinations, and conversely, Model B might have higher weights in other combinations. Consequently, if both models are rewarded in the same set of weights, it is very likely that each model receives a reduced weight due to the fact that both models are providing information to the model average. Therefore, model dependence can play a role in the BMA scheme since both of the dependent models can affect each other’s weights, which can be portrayed in the posterior samples. See section 2 in the online supplemental material for additional details on how dependence is implicitly inferred with the BMA method.

To explain how BMA estimates the model weights, consider that at a given location we have the output of multiple models (e.g., Fig. 1). The goal is to weigh the different models such that the weighted estimate is a better predictor of the observed system behavior than any of the individual models of the ensemble. Thus, the estimated model weights using BMA are as follows:

where *w*(*m*_{i}, *i* = 1, 2, 3, …, *K*) represents the optimized weights of *K* models after fitting to the observations using a chosen likelihood function. The range of *w*(*m*_{i}) is between 0 and 1, with a weight of 0 for models that do not contribute any information and a weight of 1 for models that fully contribute to the projection. The sum of *K* model weights $\u2211i=1Kw\u2061(mi)$ is equal to 1. The BMA weights are estimated using a Markov chain Monte Carlo (MCMC) algorithm (Vrugt 2016), and the final estimated model weights are *K* distributions of weights where each distribution is not required to follow any form, e.g., Gaussian, bimodal, etc. The final estimates of the BMA model weights, or **w**_{m},_{BMA} in Eq. (1), are utilized to constrain the spread of uncertainty in the projected end of century climate.

#### Setting up Bayes’ theorem

The likelihood function is a critical property of the BMA calculation. According to Bayes’ theorem, the probability of an event is estimated based on prior knowledge of conditions that might be related to the event. In equation form, this looks like

In the BMA application of the current study, event *A* can be representative of the chosen model combination and event *B* can be representative of the observed data, *P*(*A*|*B*) is the conditional probability or the likelihood of event *A* occurring given that event *B* is true, *P*(*B*|*A*) is a conditional probability of event *B* occurring given that event *A* is true, and *P*(*A*) and *P*(*B*) are the probabilities of observing events *A* and *B* independently of each other. For the purposes of this study, we can express *P*(*A*) as the prior information of our calculation, which is an equally weighted distribution for each of the model weights and thus can be taken out of the calculation. Probability *P*(*B*) is the evidence of the observation or models, and is a normalizing constant and therefore taken out of the equation. This leaves us with

where *P*(*A*|*B*) is the final distribution of the model weights, or **w**_{m,BMA} in Eq. (1), and *P*(*B*|*A*) is equivalent to the chosen likelihood function *L*(**w**_{m,BMA}) described in the next paragraph. Therefore, the BMA algorithm searches for model weight combinations **w**_{m,BMA} that will maximize the fit to the observed data, and thus will maximize the value of the likelihood function *L*(**w**_{m,BMA}).

In recent decades, Bayesian inference has emerged as a working paradigm for modern probability theory, parameter and state estimation, model selection, and hypothesis testing (Vrugt and Massoud 2019). According to Bayes’ theorem, the distribution of model weights *P*(*A*|*B*) depends upon the prior distribution *P*(*A*), which captures our initial beliefs about the values of the model weights, and a likelihood function *L*(**w**_{m,BMA}), which quantifies the confidence in the model weights **w**_{m,BMA} in light of the observed data $Y$. The observed data in this case are a spatial map of precipitation characteristics (as shown in Fig. 3), and our goal is to find the optimal set of model weights **w**_{m,BMA} that produces a model combination $X$, which maximizes the fit, or the likelihood, relative to one or more of the observations. Our likelihood function is set up in the simplest terms as

where *i*, *j* refer to the longitudinal and latitudinal indices of grids on the map; $Y$(*i*, *j*) is the observed precipitation metric at grid *i*, *j* obtained from TRMM; and $X$(*i*, *j*) is the BMA-weighted model ensemble average of the precipitation metric at grid *i*, *j*. We apply MCMC sampling on the model weights in an optimization framework until the likelihood function in Eq. (4) is maximized, which allows for the estimation of the optimized model weights, or **w**_{m,BMA} in Eq. (1). These optimized model weights will be used to inform, as well as constrain the spread of uncertainty in, the model projections of precipitation change at the end of the century.

Successful use of the MCMC applications depends on many input factors, such as the number of chains, the prior used for the parameters, number of generations to sample, the convergence criteria, among other things. For our application, we used *C* = 8 chains, the prior was a uniform distribution from 0 to 1 for each model weight and each sampled set of weights was normalized so that the sum of weights is equal to 1, the number of generations was set at *G* = 5000 for each metric being fit, and the convergence of the chains relied on the Gelman and Rubin (1992) diagnostic, where we applied the commonly used convergence threshold of *R* = 1.2.

## 3. Results

### a. Model evaluation and BMA weighting

First, equally weighted multimodel average (“ens mean” in Table 1) are produced by averaging the 12 CMIP5 models, then the BMA-weighted model ensemble averages for the five summary metrics are produced based on evaluation of each model against the TRMM observations. Table 1 lists all the model weights for each strategy, including the mean value of the optimized model weight distributions estimated for each BMA strategy. Ens mean refers to the equal weighting strategy; for this, the uncertainty of the projections spans the range of uncertainty from all models considered. In other words, the spread of model simulations from all of the CMIP5 models used in this study produce the uncertainty for the ens mean strategy. The “mean weight” column of Table 1 lists the average weight each model receives from combining all of the other columns that show the five BMA weights. Last, the “BMA-OFx” columns of Table 1 list the weights produced for optimizing the fit of the model average to each precipitation performance metric (i.e., those depicted in Fig. 2). BMA-OFx refers to the BMA weighting for the model performance metric, or “objective function,” of interest. So, for example, BMA-OF1 refers to the BMA-weighted model ensemble average for the first performance metric, or the long-term mean daily precipitation.

The boxplots in Fig. 4 show the distribution for the estimated BMA model weights for each CMIP5 model. Figure 4a displays samples from the prior distribution *P*(*A*). The remaining panels show the distributions of each model weight for each precipitation metric. These distributions are the outcome of maximizing the likelihood function in Eq. (4) for each objective function. Figure 5 shows the model weights listed in Table 1, or the mean weights from the distributions in Fig. 4, but in a colored diagram. In Fig. 5, the red boxes indicate models with a higher weight than the ensemble mean weights, and blue boxes show models with lower weights.

### b. Historical simulations

In Fig. S1 in the online supplemental material, we show maps of the first precipitation performance metric, or the historical long-term mean daily precipitation, for each individual CMIP5 model as well as the model averages produced (i.e., the ensemble mean and BMA-OF1). These maps are compared with the observational reference from TRMM (see Fig. 3a), and we can analyze the bias from each model; these bias maps are shown in Fig. 6. To understand the magnitude of error from each model, Table 2 lists the RMSE of each model compared to TRMM data, for each of the precipitation performance metrics. We can see in Table 2 that the best performing model with the lowest RMSE is the BMA-weighted model ensemble average estimated for each metric. In other words, BMA-OF1 is the best performing model ensemble average for the first precipitation performance metric, or the long-term mean daily precipitation. BMA-OF2 is the best for the second precipitation performance metric, or the annual cycle variability. This is also the case with BMA-OF4 and BMA-OF5, which are the best performing model ensemble averages for the fourth and fifth precipitation performance metrics, respectively. The exception here is the third precipitation performance metric, since the best performing models for this are the MPI-ESM-LR and ACCESS1.0 models, yet the BMA-OF3 model ensemble average is also very skillful for this metric. Thus, the BMA-weighted model ensemble averages are generally some of the most skillful model candidates when compared to the TRMM data.

### c. Trade-offs in model weights

Each model combination has a different level of performance skill for each metric that is chosen. For instance, the optimized BMA-OF1-weighted model ensemble average will theoretically perform the best for OF1 (mean precipitation), but this model’s performance may degrade for the other metrics (variability or extremes in precipitation). We can check how each BMA-OFx-weighted model ensemble average performs for the other metrics with the values shown in Table 2. Interestingly, each of the BMA-weighted model ensemble averages have a relatively strong performance skill regardless of the metric of interest. Table 2 shows how each BMA-weighted model ensemble average has a relatively low RMSE for all the metrics. Interestingly, the “ensemble mean” and the “mean skill weights” model averages perform well for mean precipitation; however, much of the variability in the simulation is lost for these model averages and they are not as skillful in simulating metrics related to variability (i.e., OF2 and OF3). Therefore, the ensemble mean and the mean skill weights are the least accurate model averages for simulating OF2 and OF3, or annual and interannual variability.

Figure 7 shows how the performance trade-offs look like for each metric. In Fig. 7, there are plots of the root-mean-square error space for triplet combinations of the BMA-weighted model ensemble averages, along with the ensemble mean model average, the mean weight model average, and various solutions from the prior samples. A perfect model would be located where the 0–0–0 point is, which reflects a model with 0 error for all objectives, and is depicted with a red bull’s-eye in the figure. The goal is to move from the location of the prior samples (light blue dots) toward the red bull’s-eye, which is noted with a large black arrow in each figure. The prior samples, along with the ensemble mean (red X) and the mean weight (black X) sets of models averages, generally have a higher RMSE than the BMA-weighted model ensemble averages (shown in blue, magenta, and green in the plots). These figures show how the BMA-OF1 model has the best performance for OF1 (mean precipitation), BMA-OF2 performs the best for OF2 (annual cycle variance), etc. Generally, the BMA-OFx models have higher skill for all precipitation summary metrics compared to the prior samples, the ens mean, or mean weight model averages, which matches the values shown in Table 2.

One thing that is important to note in Fig. 7 is that these plots indicate what the pareto trade-offs are for each model average [see Langenbrunner and Neelin (2017) for more information on pareto fronts]. In general, the larger the distance between the RMSE distributions in these figures, the larger the trade-offs and thus the differences in the future change estimates are between the model combinations. Alternatively, the closer the RMSE distributions, the more consistent the projected estimates are between the model combinations. This is important to assess the uncertainty of the future projected changes from each model average.

## 4. Projected changes in precipitation over CONUS

### a. Future changes in precipitation

The motivation for this study is that climate models simulate different projections for future changes in precipitation, which results in significant uncertainty for most regions of CONUS. To illustrate this, we show in Fig. S2 the end-of-century (RCP8.5) mean daily precipitation (OF1) for each individual model and for the model averages, i.e., ensemble mean and BMA-OF1 models. In Fig. 8, we show the difference between the future precipitation and the historical estimate.

The maps in Fig. 8 can be used to show the difference between each model’s estimate of future precipitation changes. The model averages, such as the ensemble mean and BMA-weighted model ensemble averages, show a smaller magnitude change relative to what the individual models might indicate. However, spatial patterns of change are similar between most of the models, i.e., the East Coast and Northwest will experience increases in precipitation, while the remainder of the country will not experience much change. Some disagreements in the estimated future change between models can be seen, for example the CanESM2 model simulates an increase of precipitation over the majority of the West Coast, including California, whereas other models, such as ACCESS1.0 or HadGEM2-CC show decreases in precipitation for these areas. This results in significant uncertainty in estimated changes of future precipitation for these regions.

Changes in future precipitation may vary depending on the season. A number of studies (e.g., Collins et al. 2013) have shown that projections may increase in one season and decrease in the other, resulting in no change when looking at just annual precipitation change. For example, Fig. S3 shows that the east coast of CONUS is projected to experience significant increases in precipitation in the winter months (DJF), but not in other seasons. In winter (DJF) and spring (MAM), the northern part of the country is projected to become wetter as the global climate warms, but drier for southwestern CONUS during these seasons. During the summer (JJA), most of the country, with the exception of the northeast, is projected to experience decreases in precipitation. For the fall (SON), many regions of the country will not experience significant changes in average precipitation, except for the northwest, which is projected to become significantly wetter.

### b. Verifying results to different observation data and time periods

Climate model evaluation for precipitation indices over CONUS can be highly sensitive to the reference observational products (Gibson et al. 2019). We compare the BMA-weighted model ensemble average trained using TRMM with BMA-weighted model ensemble averages using CPC station data. We apply the BMA model weight estimation to various time periods to minimize the influence of the results to climate variability. In Fig. S4, the first column of figures shows the historical mean precipitation obtained from each set of BMA weights. The middle column shows the estimated BMA weights for the various data products and time periods. The right column of figures shows the projected future change (RCP8.5) in mean precipitation using each of the BMA weights. As shown in Fig. S4, similar BMA weights are produced when constrained by the TRMM satellite or the CPC in situ based datasets, regardless of the time period chosen. Furthermore, the maps of the first precipitation performance metric, or the historical and the future long-term mean daily precipitation, are similar for all the model averages produced.

We would like to point out that there could be concern about using the TRMM precipitation data product over CONUS when there are other longer (arguably more reliable) in situ products available, such as CPC. However, for precipitation over the oceans and for higher temporal resolution, TRMM data can provide some benefits worth mentioning. These two advantages of TRMM will enable future studies that apply other evaluation metrics with BMA for many different parts of the world. Since TRMM is shown to be viable at constraining models despite its short record over CONUS (i.e., Fig. S4), one could use TRMM in other regions of the world, where long-term in situ measurements are sparse (e.g., parts of Africa or South America), to constrain future projections of precipitation from models.

### c. Reduction of uncertainty in the BMA model averages

A main goal of producing weighted model averages using the BMA framework is to reduce the spread of uncertainty in the historical and the future estimates. Figure 9 shows the spread of uncertainty, calculated as the standard deviation of the ensemble spread, of the long-term mean daily precipitation for both the Ensemble Mean (Figs. 9a,b) and the BMA-OF1 (Figs. 9c,d) weighted model averages. What is apparent in this figure is that the BMA method reduces the spread of uncertainty in the historical (Fig. 9c) and the future (Fig. 9d) projections for most regions of CONUS by about a third compared to the original subset of CMIP5 models used (Figs. 9a,b). The biggest reduction in uncertainty seems to be located near the western parts of CONUS, where uncertainty in precipitation is generally very significant. The next section will discuss precipitation changes in the future more thoroughly and will show how the results can change depending on the choice of model average that is used.

### d. Sensitivity of future estimates to the choice of performance metric

The choice of ensemble average (based on which performance metric to optimize) can impact the estimates of future precipitation changes considerably. To examine how each ensemble average can affect the projections, we show in Fig. 10 a matrix of maps that displays the sensitivity of each precipitation metric to the choice of model average. The plots in Fig. 10 show that, by the end of the century, the mean rainfall (first column of figures) is projected to increase for most of the East Coast and the Northwest by roughly 10%, might decrease in the High Plains area by about 10%, and very little change is expected for the Southwest region, which matches results reported in Sanderson et al. (2017) and Langenbrunner and Neelin (2017). As for the annual cycle (second column of figures), variance is projected to increase for the Northwest and the Southeast by about 10%, but decrease for the High Plains area by roughly 20%. For interannual variability (third column of figures), variance is projected to increase for most of the United States by 5%–10%, especially on the East where it might increase by up to 30%. Finally, for extremes (fourth and fifth columns of figures), it is apparent that the rainfall distributions are getting wider, as the wettest year on record is projected to be 20% more wet for most regions and the driest year on record seems to get drier by about 10%, except for the Northwest and Southeast regions where the driest year may be slightly more wet in the future.

The uncertainty and sensitivity between the various BMA produced ensemble model averages for each metric is shown in Fig. S5. These plots show the spread in possible outcomes for each precipitation metric, using all the BMA-OFx model ensemble averages. This in essence provides a view of how the choice of constraint (i.e., performance metric) can affect the future projections, and which regions of CONUS have uncertainty in the expected changes in precipitation. We see that for mean precipitation (Fig. S5a), the highest uncertainty in the projected changes is in the West and in the Great Lakes regions. This metric is shown to be somewhat sensitive to the choice of constraint relative to the other metrics. For the annual cycle (Fig. S5b), the greatest uncertainty in the projections is again in the West. There is not much uncertainty between the various model averages for interannual variability (Fig. S5c). For the metrics related to variability (Fig. S5b,c), there is low sensitivity to the choice of constraint relative to other metrics. Last, for wet and dry extremes (Fig. S5d,e), there are significant uncertainties between the various BMA ensemble model averages, especially in the West. These metrics related to extremes seem to be the most sensitive to the choice of constraint, compared to the other metrics, especially for the wet extremes. This indicates that the projection of the maximum precipitation (OF4) is the most sensitive to the choice of constraint out of the precipitation metrics we chose for this study.

This study shows the use of BMA and its strength for constraining climate projections. It is likely that climate models which produce simulations closer to historical observations have similar performance for the future under a stationary assumption. Yet, growing evidence dictates a more nonstationary future climate. It is questionable that climate models identified as better tools based on the relatively stationary history will be better representative of the future, particularly the future climate extremes. Yet, it has been shown in the literature that trends in precipitation might not be as significant as trends in other climatic variables, such as temperature. This has been shown in Gibson et al. (2019) and in Figs. S1 and S2, where trends in various precipitation metrics and from various products are shown for CONUS. The amplitude of the trends (~mm yr^{−1}) are orders of magnitude smaller than the variability itself or the magnitude of the change in future precipitation metrics (~mm day^{−1}). This allows us to believe that the climate models that have higher weights based on historical fit might be better representative of the future of precipitation.

## 5. Conclusions

This study showcased Bayesian model averaging (BMA) as a framework to achieve more accurate simulations and to constrain the spread of uncertainty in climate change projections. We provided an extensive look at end of century precipitation changes over CONUS (Fig. 10), including estimates of changes in long-term mean daily precipitation (OF1), annual cycle variability (OF2), interannual variability (OF3), and wet (OF4) and dry (OF5) precipitation extremes (as depicted schematically in Fig. 2, and shown in Fig. 3 for the TRMM satellite product). A suite of models from the CMIP5 archive were used for the model averaging, and the BMA weights (Table 1; Figs. 4–5) were trained to reduce the bias between the model ensemble averages and the TRMM satellite observations (Figs. 6 and 7). We found that the BMA-weighted model ensemble averages used in this study were generally more accurate than the individual CMIP5 models or other model averages, such as the ensemble mean, when compared to the TRMM satellite data (Table 2; Figs. 6 and 7), and had less uncertainty in the simulations (Fig. 9) compared to the original ensemble spread.

We presented a sensitivity experiment to the future precipitation projections, based on the various model weighting strategies that were used in this study (Fig. 10). Our results showed that, by the end of the century, the mean daily rainfall is projected to increase for most of the East Coast and the Northwest, may decrease in the High Plains area, and with little change expected for the Southwest. For mean daily rainfall, the projected changes were consistent between the ensemble mean and the BMA-weighted model ensemble average. The strength of the annual and interannual cycles is expected to increase for most of the United States, with the exception of the High Plains area where annual variability is projected to decrease. For annual and interannual variability, the projected changes in variability were more pronounced for the BMA-weighted model ensemble average than for the multimodel ensemble mean. As for extremes, the wettest year on record is projected to become wetter for the majority of CONUS and the driest years to become drier. For extremes, the projected changes were consistent between the multimodel ensemble mean and the BMA-weighted model ensemble average. It is important to note that different fitting metrics produce different sets of model weights. One must consider these differences when selecting a set of model weights for potential downscaling or other applications, and this consideration should be dependent on the precipitation metric of interest.

We used BMA as a method to average an ensemble of model projections, with the aim of constraining the spread of uncertainty that is generated with using multimodel ensembles with highly varying climate simulations. We believe that the BMA approach can be used for different climate variables, including temperature. However, given the nonstationarity of temperature trends around the globe, we recommend that one must use an additional fitting metric that represents the trends in temperature in a given grid cell. Similar to how we used five precipitation metrics in our study to represent various aspects of the precipitation distribution, one could use similar metrics to represent the distribution of other climate variables, and BMA can be applied to fit those metrics.

In our study, the BMA-weighted model reduced the uncertainty in future precipitation projections by about a third (Fig. 9). The BMA framework can be very useful for efforts that estimate changes in climate by the end of the century. With many versions of climate models being produced and developed and with the preparation of models for CMIP6, it is important to answer calls in the literature (e.g., Eyring et al. 2019) to investigate advanced weighting methods that will produce more tightly constrained climate projections (as we showed in Fig. 9) and to more rigorously quantify uncertainty sources. We think that BMA shows promise in being one of these methods, and believe it can be used for climate model evaluation and for constraining the spread of uncertainty in climate model projections.

## Acknowledgments

This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. This research was funded by NCA Award 106232. All rights reserved. All data used in this study are publicly available. The TRMM satellite dataset is available via https://pmm.nasa.gov/data-access/downloads/trmm, and the CMIP5 model output data are available via https://cmip.llnl.gov/cmip5/data_portal.html.

## REFERENCES

*J. Climate*

*Geosci. Model Dev.*

*Atmos. Res.*

*Climatic Change*

*Mon. Wea. Rev.*

*Climate Dyn.*

*J. Amer. Water Resour. Assoc.*

*San Francisco Estuary Watershed Sci.*

*Water*

*Adv. Water Resour.*

*Climate Science Special Report: Fourth National Climate Assessment*, Vol. I, U.S. Global Change Research Program

*Geophys. Res. Lett.*

*Nat. Climate Change*

*Geosci. Model Dev.*

*Nat. Commun.*

*Geophys. Res. Lett.*

*Stat. Sci.*

*J. Hydrometeor.*

*Science*

*Geophys. Res. Lett.*

*Earth Syst. Dyn.*

*Eos, Trans. Amer. Geophys. Union*

*Int. J. Climatol.*

*Stat. Sci.*

*J. Hydrometeor.*

*Earth’s Future*

*Geophys. Res. Lett.*

*Bull. Amer. Meteor. Soc.*

*Nat. Climate Change*

*J. Climate*

*Geophys. Res. Lett.*

*J. Atmos. Oceanic Technol.*

*J. Appl. Meteor.*

*Geophys. Res. Lett.*

*Geosci. Model Dev.*

*Geophys. Res. Lett.*

*Sci. Rep.*

*Earth’s Future*

*Geosci. Model Dev.*

*Geosciences*

*Bull. Amer. Meteor. Soc.*

*Climate Change Impacts in the United States: The Third National Climate Assessment*. U.S. Global Change Research Program, 841 pp.

*Geophys. Res. Lett.*

*PLOS ONE*

*J. Geophys. Res. Atmos.*

*J. Climate*

*Bull. Amer. Meteor. Soc.*

*J. Climate*

*J. Geophys. Res. Atmos.*

*Mon. Wea. Rev.*

*Geosci. Model Dev.*

*Geophys. Res. Lett.*

*J. Climate*

*Nat. Geosci.*

*Geophys. Res. Lett.*

*Geophys. Res. Lett.*

*Bull. Amer. Meteor. Soc.*

*Philos. Trans. Roy. Soc.*

*Climate Science Special Report: Fourth National Climate Assessment (NCA4)*. Vol. I, D. J. Wuebbles et al., Eds., U.S Global Research Program, 470 pp.

*Environ. Modell. Software*

*Water Resour. Res.*

*Handbook of Hydrometeorological Ensemble Forecasting*

*Water Resour. Res.*

*Climate Change Impacts in the United States: The Third National Climate Assessment*, J. M. Melillo, T. C. Richmond, and G. W. Yohe, Eds., U.S. Global Change Research Program

*J. Climate*

*J. Hydrometeor.*

*J. Geophys. Res. Biogeosci.*

## Footnotes

Denotes content that is immediately available upon publication as open access.

Climate Change 2013: The Physical Science Basis, T. F. Stocker et al., Eds., Cambridge University Press