1. Introduction
Precipitation is an essential element to society and environment. Beyond its role in providing for agriculture, industry, and personal water needs, precipitation interacts with other climate variables, ultimately shaping the world as we know it. In a given location, a first-order picture of the effects of precipitation is given by its temporal mean (i.e., the total precipitation that falls in a given period), but this leaves out important details, such as the variability of precipitation in the region. Thus, because society is not only adapted to mean conditions, a complete assessment of local precipitation should characterize the whole temporal distribution of rainfall.
An important tool for projecting the local precipitation response to a variety of forcings, including different global warming scenarios, is the use of highly sophisticated global climate models (GCMs). These models aim to simulate credible realizations that can be plausibly compared to the actual evolution of weather and climate. Due to the complex interactions that give rise to rainfall, precipitation is one of the most challenging variables for GCMs to simulate (Flato et al. 2013). Indeed, different models often use different versions of large-scale and convective precipitation parameterizations (which parameterize subgrid-scale processes not explicitly simulated), with no approach being immune to modeling issues. Previous phases of the Coupled Model Intercomparison Project (CMIP) have revealed long-standing problems in simulating, for example, how often and how hard it rains (Stephens et al. 2010; Rosa and Collins 2013; Terai et al. 2018), the magnitude of extremes (O’Gorman and Schneider 2009; Gervais et al. 2014; Wehner et al. 2014; Abdelmoaty et al. 2021), and the shape of the PDF (Pendergrass and Hartmann 2014; Terai et al. 2018; Chen et al. 2021)—simulated daily precipitation PDFs are often more complex than observed, including deviations in the low- to medium-intensity regime. The main goals of this paper are to introduce a set of metrics to evaluate the probability distributions of daily precipitation, to place them in context of literature on their physical interpretation, and to apply them in an initial evaluation of simulations from phase 6 of CMIP (CMIP6).
There is considerable discussion regarding biases in frequency and intensity of wet-day precipitation [e.g., Flato et al. (2013)], but for a number of reasons these metrics do not necessarily correspond to the fundamental physical processes on which simulations of precipitation are based. This may be why they have not led to substantial improvement over the generations of model development during which awareness of these issues has grown. More physically motivated metrics, and those that are robust to spatial and temporal resolution, may be able to break this deadlock.
Since the last generation of CMIP simulation, our understanding of how physical processes govern the shape of daily precipitation PDFs has improved. Martinez-Villalobos and Neelin (2019), using a model based on a simplified version of the moisture equation (Stechmann and Neelin 2014; Neelin et al. 2017), provide a first-order explanation on how the moisture budget controls the shape of daily precipitation PDFs and why they have shapes that are often approximated by gamma or similar distributions. A gamma distribution has historically been one of the most popular choices to empirically fit daily precipitation PDFs over wet days (Barger and Thom 1949; Thom 1958; Ropelewski et al. 1985; Groisman et al. 1999; Wilby and Wigley 2002; Watterson and Dix 2003; Husak et al. 2007; Martinez-Villalobos and Neelin 2018; Chang et al. 2021), although other gamma-like alternatives are also used (e.g., Wilks 1998; Wilson and Toumi 2005; Papalexiou and Koutsoyiannis 2012).
To leading order, the bulk of the PDF of observed rainfall contains two ranges, governed by different physical balances, and characterized by different metrics: 1) a range with no dominant physical scale (“scale-free range”) at low-to-medium intensities, approximated by a power law with exponent −τP (τP < 1) controlling the probability of low and moderate daily precipitation values; and 2) a range governed by a dominant scale, namely the precipitation cutoff scale PL that controls the probability of medium-to-large events. These two ranges can be captured to a leading approximation for present purposes by a gamma distribution for simplicity. We emphasize that we are not relying on conformance to a particular distribution, but we use gamma-like distribution properties to inform metrics and their interpretations, and the relationships among them. For applications to more subtle features, such as deviations from the approximate power-law scaling at low values (Papalexiou and Koutsoyiannis 2016), or accurately capturing the folding of the very extreme tail (Papalexiou and Koutsoyiannis 2013; Cavanaugh et al. 2015; Papalexiou and Koutsoyiannis 2016), then distributions with an additional parameter (e.g., generalized gamma distribution, Burr type XII distribution) can be better suited (Papalexiou and Koutsoyiannis 2012)—similar considerations to those presented here can still apply, although with added complexity, as further discussed in section 2. The approximate power-law range arises from fluctuations across the threshold between raining and nonraining conditions. For daily average precipitation, a main control of the exponent τP is the number of individual precipitating events within wet days (Martinez-Villalobos and Neelin 2019)—all else being equal, regions with fewer events per day tend to have steeper power-law ranges. For the approximately exponential range governing large events, the cutoff scale PL, which is the main precipitation scale in observed PDFs, is given by a balance between the variability of moisture converging during precipitating events and a measure of moisture loss by precipitation in them (Stechmann and Neelin 2014; Neelin et al. 2017; Martinez-Villalobos and Neelin 2019).
We expect a variety of different model representations of the processes that yield PL and τP, so as a first-order picture we evaluate how well models simulate these parameters. However, models may sometimes deviate from the power-law and cutoff-scale picture that tends to hold in satellite-based precipitation products (see below) and station data (Schiro et al. 2016; Martinez-Villalobos and Neelin 2018, 2019; Chang et al. 2020). Simulated PDFs can thus be more complex; for instance, a bump can indicate an artificial scale introduced into the scale-free range. Thus, in addition to PL, τP and other commonly used scalar metrics (mean, standard deviation, fraction of wet days), we also employ metrics that evaluate the “shape” of simulated probability distributions and their probability distance compared to their observed counterparts.
The paper is organized as follows. Section 2 presents the data and introduces the metrics used. Section 3 gives an overview of the uncertainty between different observational products used to compare to models. This observational uncertainty is necessary to adequately evaluate model performance. Section 4 presents the model evaluation. Section 5 summarizes the study and discusses its implications.
2. Data and methods
a. CMIP6 models and observational datasets
We use daily precipitation from the first variant of 35 CMIP6 models (see Table 1) over the period 1990–2014. To estimate observational uncertainty bounds for meaningful comparison with models, we use six different daily precipitation products: TRMM-3B42 v7.0 (50°S–50°N, 1998–2016) and its microwave-calibrated (IR) and microwave-only (MW) variants (Huffman et al. 2007), CMORPH V1.0 CRT (60°S–60°N, 1998–2017) (Xie et al. 2017), PERSIANN CDR v1 r1 (50°S–50°N, 1983–2017) (Ashouri et al. 2015), and GPCP 1DD CDR v1.3 (90°S–90°N, 1997–2017) (Huffman et al. 2001), all taken from the Frequent Rainfall Observations on Grids (FROGS) database (Roca et al. 2019). These are satellite-based products with correction to gauges over land. Some models have relatively coarse native resolutions (see Table 1), so all models and datasets are coarsened, using the ESMG_regrid function in NCL under the “conserve” option, onto a 3° × 3° latitude–longitude grid prior to analysis.
List of models.
b. PDFs and contributions
c. Terminology and normalization
It is worth clarifying differences in nomenclature and in normalization between (linear) precipitation and log-precipitation variables that exist in the literature. Each approach is self-consistent, but confusion can arise especially when comparing and interpreting the different approaches. In one approach, the PDF normalized in precipitation (Fig. 1a), plotted here in log-P space, has a long history (e.g., Barger and Thom 1949; Thom 1958; Groisman et al. 1999; Katz 1999; Watterson and Dix 2003; Wehner et al. 2014). From this point of view, Figs. 1b and 1c (or Figs. 1e,f) give the quantities that integrate to the first and second moment, referred to here and elsewhere as contributions (to the relevant integral) (Karl and Knight 1998; Neelin et al. 2009; Klingaman et al. 2017; Kuo et al. 2018; Wang et al. 2021); in other cases, these are referred to more explicitly as the frequency density times the variable P (Watterson and Dix 2003). In a second approach, the PDF normalized in log-P has been termed the frequency distribution, and the log-P frequency density multiplied by P has been termed the amount distribution (Pendergrass and Hartmann 2014; Kooperman et al. 2016a,b; Pendergrass et al. 2017; Akinsanola et al. 2020). The translation between calculations in P and log-P coordinates is simply a factor of P. Specifically, if we denote a PDF calculated in P coordinates as fP(P) and its counterpart calculated in G(P) = log(P) coordinates as
This factor of P implies that the PDF or frequency density normalized in log-P coordinates (Pendergrass and Hartmann 2014) has the same shape as the precipitation amount contribution in P coordinates, and that the amount distribution normalized in log(P) coordinates has the same shape as the precipitation variance contribution in P coordinates.
We briefly summarize advantages of each approach, which are useful for different purposes, illustrating these in the context of three common axis choices (Fig. 1). Regardless of normalization, a log-log plot of the PDF (Fig. 1a) facilitates comparison of the light precipitation range to a power law. In the contributions to the first and second moments (Figs. 1b,c) the slope of this range is increased by 1 and 2, respectively. A log-linear plot facilitates examination of the degree to which the large-event range is approximately exponential (Figs. 1g–i). A linear-y axis with log-x axis (Figs. 1d–f) makes it easier to see differences among models in the low-medium precipitation range. Using a log-P normalization with a linear-y axis has the advantage of providing a log-P frequency density plot (Fig. 1e), with area visually proportional to its integral (Pendergrass and Hartmann 2014). The apparent difference in interpretation of frequency of light rain between the P and log-P normalizations is resolved by noting that an integral over a range of precipitation, such as over the interval (0–0.1 mm day−1) in P coordinates, is spread over a semi-infinite interval in log-P coordinates, with corresponding adjustment in the frequency density. Distributions expressed and normalized in terms of linear P are the traditional basis for statistically fit distributions such as the gamma distribution (discussed below), but we emphasize that the linear- and log-P approaches are equivalent as long as one is consistent in their use. The choice of P or log-P normalization also has some bearing on how we interpret the numerical value of the probability distance metrics in section 2d(2)(ii).
While contributions provide a generic way of referring to these quantities, their significance in model evaluation merits a more specific nomenclature, summarized in Table 2. In what follows, when discussing in terms of linear-P normalization, we equivalently use PDF or frequency density (Fig. 1a), precipitation amount contribution for its first moment (Fig. 1e), and precipitation variance contribution for its second moment (Fig. 1f). When discussing in terms of log-P normalization, we use log-P frequency density (Fig. 1e) and log-P amount (Fig. 1f).
Nomenclature and mathematical relationship among the normalizations of the precipitation distribution and its moments; compare to Fig. 1.
d. Metrics used to evaluate models
1) Bulk measures of the PDF
The factor
Although different estimation methods such as maximum likelihood or linear regression (in log-log or log-linear coordinates for relevant ranges of the PDF) may provide different numerical values, these are generally spatially well correlated (Martinez-Villalobos and Neelin 2019). We consider a day wet when the daily precipitation is at least 0.1 mm. In some instances we plot
Alternatively, wet and dry days could be assessed jointly by considering a mixed-type PDF [conditional and unconditional moments are related as in Eqs. (18) and (19) in Papalexiou (2018)]. Here, we choose to analyze wet and dry times separately.
2) Evaluating the fit and shape
(i) Gamma distribution fit
The estimators for PL and τP can be expected to approach their actual values as long as the gamma distribution provides a good fit. We note that several other distributions produce gamma-like features over a range of their parameters (Cho et al. 2004; Kirchmeier-Young et al. 2016) and may account for some subtle features, such as deviation from a strict exponential decay of the extreme tail (Papalexiou and Koutsoyiannis 2013; Cavanaugh et al. 2015), unaccounted by the gamma. In cases or regions where the gamma distribution fit is suboptimal, the interpretation of
A value of egamma close to 1 implies reasonably good fits while significant deviations from 1 point to progressively degraded ones.
(ii) Distance between observed and modeled PDFs and contributions
This distance is the simplest case of a family of more general probability distances (Zolotarev 1977; Korolev and Gorshenin 2020) and provides comparable results to other commonly used probability distance definitions (Martinez-Villalobos and Neelin 2021).
Similarly, we define
(iii) The shape of the PDF
A large probability distance between modeled and observed PDFs (epdf) may occur because the parameters of the PDFs (τP and PL) differ substantially (although the basic shape of the PDF may be well simulated) and/or because significant deviations in the modeled shape occur compared to the power-law range and cutoff-scale picture that holds in observational datasets. One example of these deviations is the presence of extra peaks in probability. So, to complement information provided by egamma and epdf we also track the number of peaks in the PDF, Camount, and Cvar in models compared to observational products. We note that there are other more subtle features that also imply a deviation from form, for example minimums or maximums in derivatives of the PDF. For this paper, we limit ourselves to only count peaks as a proxy for deviations from the observed shape. The algorithm used to identify these peaks take several precautions to not misidentify them (Savitzky and Golay 1964). Details are given in Text S1 in the online supplemental material.
3) Model summary score for each metric
Finally, we condense the overall differences in probability peaks between models and observations by calculating the percentage of the 240 regions previously defined where models and observational products disagree in the number of PDF, Camount, and Cvar peaks.
For all metrics, the overall score shown and discussed in the rest of the paper is a weighted average of the model differences compared separately to TRMM-3B42 and GPCP. We chose these datasets because they tend to bracket the observational estimates of the other datasets in most metrics (see next section). To contextualize the difference between models and observations, we compare each metric against the difference between GPCP and TRMM-3B42 estimates to provide a measure of the observational uncertainty. Given that TRMM-3B42 and GPCP share some input data (Huffman et al. 2007), this observational uncertainty is admittedly a conservative estimate.
3. Comparison among observational products
a. PDFs and contributions and uncertainty quantification
Different daily precipitation observational datasets are known to have substantial differences (Donat et al. 2014; Pendergrass and Deser 2017; Klingaman et al. 2017; Sun et al. 2018; Rajulapati et al. 2020; Alexander et al. 2020; Martinez-Villalobos and Neelin 2021). Before evaluating models it is important to be aware of these differences, and use them to provide a measure of observational uncertainty.
Figure 1 shows the daily precipitation PDF over the Niño-3.4 area using the six different observational datasets considered. In all cases the PDFs follow a similar shape—a power-law range and an approximately exponential drop in probability. The power-law range can be seen as a straight line in the log-log plot (Fig. 1a), occurring from the lowest value to approximately the location of the cutoff scale PL (shown in circles), and the drop in probability associated to the cutoff occurs for
This implies that the PDFs in Fig. 1a, the contribution to total precipitation in Fig. 1b, and the contribution to variance in Fig. 1c follow a similar shape in the large event range, with the main differences being in the power-law exponent (−τP for the PDF, 1 − τP for Camount, and 2 − τP for Cvar). The differences in power-law exponent imply a different shape for the low and moderate range, which results in Cvar preferentially weighted toward larger values, Camount weighted toward more moderate values, and the PDF having more of its weight in the light precipitation range. This implies that the extreme range contributes more to the second daily precipitation moment and the moderate range contributes preferentially to the total (or mean) precipitation.
Figure 2 shows the zonal average of
b. Relationships among metrics
That is, the peak of Camount is given by the mean on wet days
4. Model evaluation
In this section we evaluate models according to the metrics defined in section 2. We exclude regions poleward of 50°, as TRMM-3B42 is only given within 50°N and 50°S latitude bands. We start with the evaluation of the suitability of the gamma distribution in observations and models. Then, we evaluate the model representation of cutoff scales and power-law ranges and, subsequently, the probability distance between observed and modeled PDFs and contributions to precipitation amount and variance. These probability distances depend on how well models simulate the power-law exponent and cutoff scale parameters but also on how well models simulate the basic “shape” of the PDF. Accordingly, to end this section we evaluate model deviations from the observed shape in GPCP and TRMM-3B42 satellite products using the number of peaks in PDFs and contributions as a proxy.
a. Evaluation of the gamma distribution approximation
A global map evaluating the suitability of the gamma distribution to approximate PDFs in satellite products and in the multimodel mean is given in Fig. 3 (first and second row for TRMM-3B42 and GPCP, and third row for the multimodel mean). The first column shows the ratio between the third and second moment r3 [defined in section 2d(2)(i)], the second column shows the expected ratio if the gamma distribution held perfectly
b. Evaluating model simulation of PDF power-law exponent, cutoff scale, and fraction of wet days
Global maps of
As is the case in previous CMIP phases (Flato et al. 2013), the mean precipitation
While the CMIP6 ensemble provides credible spatial patterns of
Both the mean precipitation over wet days (Fig. 5e) and fraction of wet days (Fig. 5f) in models tend to follow the latitudinal pattern of observations, but the bias previously mentioned (models raining too frequently and too little) is evident. An exception is that the strength of precipitation over the ITCZ on wet days (
It is clear looking at Fig. 5 that some models provide substantially closer results compared to observations than others. Figure 6 provides an evaluation of their individual performance for
c. Evaluating the distance between modeled and observed PDFs
To illustrate how well models simulate daily precipitation probabilities, Fig. 7 shows PDFs and amount and variance contributions for the best and lowest performing model based on the epdf,
Contributions to variance Cvar (Fig. 7, third column) are single peaked in all cases, and are more robust in terms of shape, consistent with previous studies (Pendergrass and Deser 2017). While the shape of PDFs and contributions to amount and variance tend to be well simulated by these models in these regions, the main difference between the best and lowest performing model is in how well they simulate the power-law range and cutoff scale. Errors in these lead to deviations in probability weight (e.g., MPI-ESM-1-2-HAM puts too much probability weight in the light precipitation range in the Niño-3.4 region; Fig. 7b) and in where the contribution peaks are located.
In the examples in Fig. 7, we note that the largest difference in epdf occurs in the Niño-3.4 region, with the highest and lowest scoring models performing similarly close to satellite products in the midlatitude regions. This result tends to hold in general, with tropical regions having a larger model spread compared to midlatitudes (Fig. 8). While the model spread is large in tropical regions, on average the dry subtropics is where models have the largest differences from satellite products (Figs. 8d,e), with the exception of Cvar where the entire subtropical/tropical regions are worse simulated than the midlatitudes (Fig. 8f). To put these results in context we should note, however, that uncertainties between satellite products are large and tend to mirror model errors, with larger uncertainties over the ocean and tropical and subtropical regions (Figs. 8a–c) as highlighted in Pendergrass and Deser (2017). Overall, compared to the range between satellite products, model probability errors tend to be larger in the light and moderate range (as measured by epdf and
A ranking of the models in terms of their simulation of daily precipitation PDFs [based on the integrated epdf error; section 2d(2)(ii)] is given in Fig. 9a (Fig. S3 shows the corresponding ranking for Camount and Cvar, as well as a comparison to another metric, the Kullback–Leibler divergence; Kullback and Leibler 1951). Model performance in simulating power-law exponents and cutoff scales is a good predictor of how well models simulate PDFs—models with low RMS error in both
To quantify the extent to which model performance in simulating PDFs (Fig. 9a) can be explained by model performance in simulating cutoff scales (Fig. 6a) and power-law exponents (Fig. 6b), we calculate an epdf measure that can be attributed solely to errors in the simulation of
d. Counting the number of peaks
To a very good approximation, observed daily precipitation PDFs are characterized by a scale-free range (the power-law range, with exponent usually in the 0–1 range) and a single physical scale (the cutoff scale). This implies that daily precipitation PDFs have no interior peak (the most probable daily precipitation value is the lowest resolvable amount) and that contributions are single-peaked, with the Camount peak giving the daily precipitation intensity that most contributes to precipitation amount and the Cvar peak giving the scale that most contributes to the second moment (section 2b).
As illustrated in the bottom row of Fig. 10, important differences in the shape of observed and simulated PDFs and contributions may occur, which in the most severe cases may include additional peaks not present in observations. While we note that deviations from the power-law range and cutoff-scale shape may be more subtle, here we provide a first quantification of model differences in shape by counting the number of simulated peaks in the PDF and contributions versus observations. In contrast to other metrics, observational datasets tend to agree in these measures—both GPCP and TRMM-3B42 display zero interior peaks in the PDF and one peak in Cvar almost everywhere (Figs. 10a,c), with some differences for Camount (Fig. 10b). We should note, however, that these observational products miss light rain (Kay et al. 2018), so the existence of additional peaks in that range is not ruled out (see also section 5b). In the case of Camount, GPCP and TRMM-3B42 tend to display a single peak almost everywhere (97.5% of regions within 50°S–50°N in GPCP and 75% in TRMM-3B42); however, TRMM-3B42 tends to display no interior peaks in dry subtropical regions (22.9% of regions; see Fig. 10b), associated with a steeper power-law range there [τP tending to exceed one; see Eq. (14)].