Precipitation is a key component of the water cycle connecting processes across the atmosphere, land, ocean, and cryosphere (Trenberth et al. 2007). Through decades of development, the current generation of climate models uses increasingly sophisticated, physically based subgrid parameterizations of convection and cloud microphysics to simulate precipitation, although their horizontal resolutions are still typically much coarser than needed to explicitly resolve precipitation formation processes. When, where, how often, and how much precipitation falls has significant implications for the energy, water, and biogeochemical cycles of the Earth system. For example, biases in soil moisture can often be linked to biases in precipitation amount, frequency, and intensity, which influence the partitioning of precipitation into evapotranspiration, runoff, and soil moisture storage, with subsequent impact on surface temperature through evaporative cooling (Qian et al. 2006). Relatedly, biases in modeling the surface water and energy balance due to precipitation biases can influence clouds, convection, and precipitation through energetic constraints and land–atmosphere feedbacks. Because of the myriad Earth system interactions and feedbacks mediated by precipitation, skillful modeling of precipitation and understanding and attribution of precipitation biases are scientifically challenging (Dai 2006; Covey et al. 2016; Chen et al. 2021). As precipitation biases are among the most consequential in limiting the use of climate models for decision support, there is an urgent need to improve precipitation modeling across a wide range of spatial and temporal scales (Tapiador et al. 2019).
Quantifying and understanding model precipitation biases is an important step toward improving the overall quality of climate simulations and predictions. Metrics are objective measures for benchmarking model performance against observations and facilitating model intercomparison. Common metrics of precipitation have focused on aspects such as the spatial distribution of annual and seasonal mean precipitation, daily precipitation amount, frequency, and intensity, and the probability density function of precipitation rate (Deser et al. 2012; Chen and Dai 2018, 2019). Increasingly, metrics related to extremes such as annual maximum daily precipitation and consecutive dry days have also been used to evaluate precipitation characteristics connected more closely to societal impacts. These metrics have revealed multiple longstanding precipitation biases in climate models. For example, climate models tend to produce too frequent light daily precipitation, but not enough high-intensity daily precipitation compared to observations (Dai 2006; Stephens et al. 2010; Chen et al. 2021), while subdaily intensities can vary considerably between models (e.g., Klingaman et al. 2017). Most global climate models simulate a spurious intertropical convergence zone (ITCZ) in the southeastern Pacific and South Atlantic, resulting in a double-ITCZ bias that is most prominent during boreal winter (Mechoso et al. 1995; Lin 2007; Mapes and Neale 2011; Hwang and Frierson 2013; Oueslati and Bellon 2013; Hirota et al. 2014; Tian 2015; Tian and Dong 2020). Erroneous diurnal timing of precipitation over land is another common bias, which is most noticeable during boreal summer in regions such as the central United States featuring nocturnal peaks in precipitation (Dai et al. 1999; Tang et al. 2021). Precipitation biases have also been identified in regions with complex terrain such as the western United States (Mejia et al. 2018) and Europe (Mehran et al. 2014), in Amazonia (Yin et al. 2013), and in monsoon regions such as Asia (Sperber et al. 2013).
Although precipitation diagnostics and metrics have been incorporated in model evaluation and diagnostic packages such as ESMValTool (Eyring et al. 2020) and the PCMDI Metrics Package (PMP; Gleckler et al. 2016) used by climate modeling centers and the climate science community, they focus on limited aspects of precipitation for benchmarking global climate simulations. At the same time, over the past few years new precipitation diagnostics and metrics have been developed to deconvolve and better understand model precipitation biases. For example, Ma et al. (2013) proposed a set of metrics and diagnostics to evaluate and diagnose tropical precipitation biases and associated moist processes in climate models. Their proposed diagnostics include stratiform fraction of precipitation, probability density function of daily precipitation intensity, composites of column water vapor, column relative humidity, temperature, and specific humidity profiles as a function of precipitation intensity, and composites of stratiform rainfall fraction as a function of column relative humidity. Klingaman et al. (2017) developed a set of diagnostics and metrics for analyzing precipitation intensity and coherence on a range of time and space scales.
This study represents a collaborative effort as an outgrowth of a workshop on “Benchmarking Simulated Precipitation in Earth System Models” (Pendergrass et al. 2020) to develop more advanced precipitation metrics and demonstrate their use in benchmarking diverse aspects of precipitation from climate simulations. Three types of precipitation diagnostics and metrics are presented: 1) spatiotemporal characteristics metrics, such as diurnal variability, probability of extremes, duration of dry spells, spectral characteristics, and spatiotemporal coherence of precipitation; 2) process-oriented metrics, based on the rainfall–moisture coupling and temperature–water vapor environments of precipitation; and 3) phenomena-based metrics, focusing on precipitation associated with weather phenomena such as low pressure systems, mesoscale convective systems, frontal systems, and atmospheric rivers. These diagnostics and metrics take advantage of analysis building on advances in understanding the thermodynamic environments of precipitation (e.g., Bretherton et al. 2004; Neelin et al. 2009; Kuo et al. 2018; Chen et al. 2020) and their role in modes of variability (e.g., Wolding et al. 2020), and in tracking weather features such as atmospheric rivers (e.g., Shields et al. 2018).
While examples of the above metrics have been reported in recent literature (e.g., Klingaman et al. 2017; Ahmed et al. 2020; Feng et al. 2021a), they are deemed exploratory partly because they have not been widely used or implemented in standard metrics and diagnostics packages and partly because they allow deeper exploration of precipitation characteristics and associated processes. Some of these diagnostics and metrics require variables besides precipitation to evaluate relationships with environmental conditions, or to track weather features, so their data requirements go beyond the baseline precipitation metrics already implemented in widely used metrics and diagnostics packages (Pendergrass et al. 2020). Furthermore, additional research may be needed on interpretations of results from use of these metrics, to standardize their use, or to address technical or computational issues. Here, we apply the exploratory metrics to a common set of climate simulations from phases 5 (CMIP5; Taylor et al. 2012) and 6 (CMIP6; Eyring et al. 2016) of the Coupled Model Intercomparison Project. While our aim is not to provide an exhaustive study of the ability of these models to represent precipitation, we illustrate how such diagnostics and metrics may be used to evaluate broader aspects of precipitation in climate simulations and to explore insights that may be gained through comparative analysis of multiple metrics. With increasing model resolutions to better resolve weather and large-scale environments (e.g., Haarsma et al. 2016), the exploratory diagnostics and metrics may be even more relevant not only for benchmarking models but also for understanding the causes of model precipitation biases. They also provide useful information to support the growing and more diverse uses of precipitation from climate models and improve communications of climate model performance by connecting precipitation to commonly understood weather phenomena. A collection of such exploratory diagnostics and metrics is a valuable addition to the existing precipitation diagnostics and metrics packages that are used in the community.
We briefly summarize the observational data, climate model output, and the feature tracking methods in section 2. Key results are presented in sections 3, 4, and 5 for the spatiotemporal characteristics, process-oriented metrics, and phenomena-based metrics, respectively. Each area is presented as a module describing the diagnostics and metrics and the results of applying them to climate model outputs summarized in a multipanel figure. We conclude with discussion and summary in section 6.
2. Data and feature tracking methods
a. Observational data and climate model outputs
Several observational precipitation data products are used for benchmarking precipitation from climate simulations. These include 1) the Tropical Rainfall Measurement Mission (TRMM) Multisatellite Precipitation Analysis (TMPA-RT) (3B42; Huffman et al. 2007); 2) the Remote Sensing Systems TRMM Microwave Imager (TMI) Daily Environmental Suite on 0.25° grid, version 7.1 (Wentz et al. 2015); 3) the TRMM Precipitation Radar (PR) Rainfall Rate and Profile L2 1.5 h V7 (2A25; TRMM 2011); 4) the monthly and daily Global Precipitation Climatology Project (GPCP) V3 combined precipitation dataset (Huffman et al. 2020); 5) CMORPH bias-corrected integrated satellite precipitation estimates (Joyce and Xie 2011); 6) Precipitation Estimation from Remotely Sensed Information (PERSIANN) (Ashouri et al. 2015); and 7) Global Precipitation Measurement (GPM) Multi-satellitE Retrievals (IMERG) precipitation data V06B (Tan et al. 2019). They represent a diverse set of precipitation data derived from satellite- and ground-based remote sensing retrievals. In addition, ground-based precipitation observations at the DOE Atmospheric Radiation Measurement (ARM) Program’s Southern Great Plains (SGP) and Manacapuru (MAO) sites are also used. The ARM data used in this study are from the ARM best estimate (ARMBE; Xie et al. 2010) data products and the ARM long-term continuous variational analysis (VARANAL; Xie et al. 2004). At these ARM sites, the available surface rain gauge measurements and/or radar retrievals provide additional information to validate satellite-based precipitation products. Table 1 summarizes the spatial and temporal resolution and domain coverage of these datasets. While the highest spatial resolution available for the dataset is given in Table 1, coarse-graining of the data for comparison to models is described with each metric. As different exploratory diagnostics and metrics have different requirements for precipitation data, we do not standardize the use of observational precipitation data in calculating the metrics, but recognize the need to address uncertainty in observed precipitation products in use and interpretation of metrics.
Observational and reanalysis data for benchmarking models. Variables P, Q, U, V, T, CVW, and IR Tb are precipitation, specific humidity, zonal wind, meridional wind, temperature, column water vapor, and infrared brightness temperature, respectively.
Besides precipitation data, several global reanalysis products are used to provide gridded data of the atmospheric environments needed for calculation of some process-oriented metrics and identification and tracking of weather features for the phenomena-based metrics: 1) ERA-Interim (Dee et al. 2011), 2) ERA5 (Hersbach et al. 2020; Hoffmann et al. 2019), 3) MERRA-2 (Gelaro et al. 2017), and 4) CFSR (Saha et al. 2010). Last, the NASA Global Merged IR V1 infrared brightness temperature (Tb) data (Janowiak et al. 2017) are also used to track mesoscale convective systems (MCSs). The spatial and temporal resolutions of the reanalysis products and Tb data are also summarized in Table 1.
The exploratory metrics are applied to benchmark precipitation from the Coupled Model Intercomparison Project phase 5 (CMIP5; Taylor et al. 2012) and phase 6 (CMIP6; Eyring et al. 2016), with typical horizontal resolution of ∼1°. Two of the metrics on low pressure systems and mesoscale convective systems are applied to precipitation from several high-resolution simulations from HighResMIP (Haarsma et al. 2016) as these weather features are better defined and more reasonably resolved at higher resolution. In HighResMIP, high-resolution simulations have nominal resolutions ranging from 0.25° to 0.5°, with their low-resolution counterparts ranging from 1.0° to 1.4°. Table 2 summarizes the variables and their temporal frequency used to calculate the various metrics.
Variables and their temporal frequency used to calculate various precipitation metrics and the objectives of the metrics.
b. Feature identification and tracking methods
The phenomena-based metrics require identification and tracking of weather features in observations and simulated precipitation. A brief description of methods used to track low pressure systems (LPS), mesoscale convective systems (MCS), frontal systems (FRT), and atmospheric rivers (AR) are provided below while more detailed descriptions are provided in the cited references.
1) Low pressure systems
The TempestExtremes feature tracking algorithm (Ullrich and Zarzycki 2017) is used to track tropical low pressure systems by identifying extrema in candidate tracking variables. A systematic assessment of multiple candidate variables, hundreds of quantitative tracking criteria, and several vertical levels led to selection of the streamfunction of the 850-hPa horizontal wind (Vishnu et al. 2020) as the optimal tracking variable. Streamfunction minima were used to identify lower-tropospheric cyclonic vortices within 35° of the equator in the ERA5 reanalysis, four HighResMIP models, and the 0.25°-resolution E3SM model (Caldwell et al. 2019). The streamfunction was calculated from the horizontal wind for each dataset, with any wind velocities that were extrapolated below Earth’s surface (e.g., in ERA5) set to zero before solving the Poisson problem for the streamfunction (Vishnu et al. 2020). The resulting track dataset for ERA5, together with tracks for four other reanalyses, are available in a Zenodo repository (https://doi.org/10.5281/zenodo.3890646).
2) Mesoscale convective systems
The FLEXTRKR algorithm is used to track MCSs in observations and model simulations. An MCS is defined as a convective system with 1) cold cloud shield (CCS) > 4 × 104 km2 containing a precipitation feature (PF) with major axis length > 100 km and 2) PF area, mean rain rate, rain rate skewness, and heavy rain volume ratio larger than corresponding lifetime dependent thresholds, with 3) both conditions 1 and 2 lasting continuously for longer than 4 h. As in Feng et al. (2021b), CCS is tracked using geostationary satellite Tb data and defined using a threshold of Tb < 241 K. For model simulations, Tb is derived based on simulated outgoing longwave radiation following the empirical formulation provided by Yang and Slingo (2001). PF is tracked using the IMERG hourly precipitation data and PFs are defined as contiguous areas within the CCS with hourly rain rate > 2 mm h−1.
3) Frontal Precipitation
4) Atmospheric rivers
With few exceptions, previous studies have utilized only a single AR detection tool (ARDT) in each study, whereas over 30 ARDTs currently exist (Shields et al. 2018; Rutz et al. 2019). Recent results from the Atmospheric River Tracking Method Intercomparison Project (ARTMIP) have demonstrated that different ARDTs can produce different scientific results, which suggests that multiple ARDTs may need to be used when evaluating climate models in order to gain a complete picture of model skill in simulating ARs (O’Brien et al. 2020b). ARs are detected globally using six independently developed ARDTs, which we refer to by the following code names: ARCONNECT v2 (Sellars et al. 2017), GuanWaliser v2 (Guan et al. 2018), Lora v2 (Skinner et al. 2020), Mundhenk v3 (Mundhenk et al. 2016), TECA BARD v1.0 (O’Brien et al. 2020b), and Tempest LR (McClenny et al. (2020); O’Brien et al. 2022). These ARDTs were run on output from the MERRA-2 reanalysis as part of the ARTMIP Tier 1 experiment (Shields et al. 2018) and on output from the CMIP5 and CMIP6 multimodel ensembles as part of the ARTMIP Tier 2 CMIP5/6 experiment (O’Brien et al. 2022). The methods use a variety of heuristic rules to objectively identify atmospheric rivers from integrated vapor transport (IVT; the vertical integral of horizontal moisture transport) and/or integrated water content. For example, the widely used GuanWaliser v2 algorithm identifies ARs as continuous regions of integrated vapor transport exceeding the climatological 85th percentile, if the continuous regions meet specific geometric thresholds indicative of long and narrow regions of intense poleward moisture transport. We employ multiple ARDTs because recent literature indicates that different ARDTs may, in some instances, lead to qualitatively different answers to the same question (O’Brien et al. 2020a,b; Zhou et al. 2021).
3. Spatiotemporal characteristics metrics
Precipitation variability at different spatial and temporal scales is associated with specific processes such as convection driven by diurnal solar heating at the land surface, seasonal moisture convergence related to monsoon systems, disturbances related to convectively coupled equatorial waves, and large-scale atmosphere–ocean interactions. Diagnostics and metrics of spatiotemporal precipitation characteristics are therefore useful for relating model biases to specific mechanisms of precipitation generation at relevant ranges of spatiotemporal scales. These metrics are also useful for informing use of climate model precipitation data at appropriate spatiotemporal scales. Four metrics to benchmark the diurnal cycle of precipitation, daily precipitation and duration of dry spells, fractional contribution to the total mean rainfall from different intensities, and spatial and temporal coherence of precipitation are discussed in this section.
a. Diurnal cycle of precipitation
The Fourier analysis has been widely applied to quantifying the diurnal cycle of precipitation in both observations and GCMs. However, the model-simulated rainfall is often quite noisy and therefore is poorly fit by low-order Fourier harmonics at single grid points. Covey et al. (2016) proposed a summary metric which illustrates the model-simulated Fourier amplitude and phase, averaged separately over all land and all ocean areas, in a single two-dimensional map. This metric enables intercomparison of climate models with observations and with each other over different climate regimes, but it becomes problematic when the number of models increases. Here we extend the procedure of Covey et al. (2016) and propose a metric that clearly displays the Fourier amplitude and phase of each individual model from a large number of groups in one bar plot (C. Tao et al. 2021, unpublished manuscript).
Figure 1a shows an example of the composite diurnal harmonic amplitude and phase (in LST) of summertime precipitation from 24 CMIP6 models versus observations over land. To generate the figure, we first produce a composite diurnal time series of precipitation, averaged over many years, for each grid point. We then apply Fourier analysis on the composite diurnal cycle of precipitation and focus on the first harmonic component, following Dai (2001). Here, the diurnal harmonic amplitude and phase are averaged over all land points between 50°S and 50°N using a vector averaging method, which automatically down weights the areas with a weak diurnal cycle (Covey and Gleckler 2014; Covey et al. 2016). Model precipitation is evaluated for the period of 1996–2005. Previous studies (e.g., Dai et al. 2007) have indicated that a stable diurnal cycle can be obtained with just a few years of data. As shown, the two satellite-based observations (TRMM 3B42 v7 and GPM-IMERG) agree quite well with each other in terms of both diurnal amplitude and phase. Over land, the major deficiency of the models is the too early diurnal precipitation peak, consistent with previous studies (e.g., Dai 2006; Xie et al. 2019). The majority of the models show a diurnal harmonic phase peaking between 1200 and 1500 LST instead of early evening from the observations. The observed early morning diurnal harmonic phase over the ocean is generally captured by most of the CMIP6 models (Fig. 1b) while the corresponding diurnal harmonic amplitude is somewhat underestimated in all 24 CMIP6 models. To highlight the models with the best performance, Figs. 1c and 1d show the scatterplot of absolute model bias relative to TRMM observations in diurnal harmonic phase versus amplitude over land and over ocean, respectively. Over ocean, interestingly, models that perform better in the diurnal cycle phase tend to perform worse in amplitude (Fig. 1d). The relationship between model biases in precipitation diurnal phase and amplitude over land is less significant than that over ocean but there is a tendency for models with smaller bias in phase to have correspondingly smaller bias in amplitude (Fig. 1c). Particularly, EC-Earth3, EC-Earth3-Veg, and EC-Earth3-Veg-LR compare the best to the observations over land in terms of both diurnal amplitude and phase. Similar results are found by interpolating the data to a common grid (not shown). Generally, the impact of model resolution on the simulated diurnal cycle of precipitation is minimal.
The metric diagram can also be easily computed for smaller scales and at different locations where rich ground-based high-frequency observations are available. Figures 1e and 1f compare the simulated diurnal harmonic amplitude and phase to observations at the two ARM sites (SGP and MAO) where precipitation shows distinct diurnal variability with SGP featuring a nocturnal peak whereas MAO has an afternoon peak. Despite some discrepancies, the satellite-based products agree fairly well with the ground-based rain gauge and/or radar measurements in general. As shown in Fig. 1e, there is a large model spread in both diurnal amplitude and phase at SGP, with most of the models (all but two) failing to capture the observed nocturnal peak around midnight in which half of the models actually show a diurnal precipitation peak in the afternoon. CNRM-CM6-1-HR and FGOALS-g3 simulate the diurnal phase much closer to that observed but both significantly underestimate the diurnal harmonic amplitude. The majority of models show a diurnal precipitation peak around noon at MAO, a few hours earlier than the observed (Fig. 1f). In general, the CMIP6 models show diverse results simulating the diurnal amplitude with some overestimating the observed value and some underestimating it, but they often show consistent biases in simulating the diurnal phase. Almost all the models peak too early during the day and miss the nocturnal diurnal peak at certain regions.
To summarize, the metric developed here provides a quick comparison with observations and among models, and reasonably summarizes the systematic model errors in reproducing the diurnal cycle of precipitation over both large areas and single point locations. Particularly, by displaying the diurnal harmonic amplitude and phase from the Fourier analysis in one bar plot, this metric enables the evaluations with a focus on individual model performance from a large number of models.
b. Extremes: Daily precipitation and duration of dry spells
Despite being seemingly contrasting variables, daily precipitation and the duration of dry spells share many features in the shape of their probability distributions. The probability density functions (PDFs) of both quantities are characterized by a power-law range, where the probability decreases slowly with each order of magnitude increase in precipitation rate or duration of dry spells, up to a cutoff scale (denoted PL for daily precipitation and tL for the duration of dry spells; see Figs. 2a,b) where the probability decreases roughly exponentially (Figs. 2a,b), ultimately controlling the size of extreme percentiles (Martinez-Villalobos and Neelin 2018, 2021; Chang et al. 2020). These quantities have connections with the moisture budget, with PL (and hence also extreme percentiles) scaling with the amplitude of moisture convergence fluctuations within precipitating events (Neelin et al. 2017; Martinez-Villalobos and Neelin 2019), and tL scaling with the balance between moisture convergence fluctuations at dry times and the mean moisture source tendency (Pierrehumbert et al. 2007; Stechmann and Neelin 2014).
Recently, Martinez-Villalobos and Neelin (2021) showed that the shape of the large daily precipitation probability tail and the spatial pattern of the cutoff scale are well simulated by GCMs but there is a bias in the magnitude of PL compared to observational datasets (see also Fig. 2a). This suggests that two metrics can succinctly summarize the general model behavior of daily precipitation and dry-spell duration extremes. The first one is the spatial correlation coefficient over 50°S–50°N (the spatial extent of TRMM-3B42; see Table 1) between model simulated PL and tL patterns (see Figs. 2c,d for their CMIP6 multimodel mean) and their observational counterparts (TRMM-3B42 in this case). The second metric is the scaling factor, defined as the model area weighted mean PL or tL divided by the TRMM-3B42 observational estimate of the same quantity. The first metric tests whether extremes are well simulated spatially regardless of magnitude (values can range between −1 and 1, with 1 denoting a model that simulates the spatial pattern of TRMM-3B42), and the second tests the overall magnitude of the pattern (values can range between 0 and infinity, with 1 being the best). To gauge model behavior, we also calculate the same metrics comparing GPCP versus TRMM-3B42 as a measure of observational uncertainty. The differences between observational precipitation products can be large, thus, model results may be sensitive to the choice of target observational product. This sensitivity is discussed in section 4 and in Fig. S2 in the online supplemental material. We note the caveat that part of the differences between models and observational products noted below may be the result of sampling different internal variability realizations (Deser et al. 2012) due to the relatively short span in which precipitation observational products have been available. However, different realizations from models of the same family (e.g., GFDL models, CNRM models) tend to perform similarly, which suggest that sampling variability has only a minor effect on the results. More details on these metrics and methodology are given in the online supplemental material.
Figures 2e and 2f show the results for 35 CMIP6 models and for the multimodel ensemble mean (MME) for PL and tL respectively. We first find that there is a substantial observational uncertainty for PL. The overall magnitude of PL in GPCP is about 70% (scaling factor of 0.68) of TRMM-3B42 magnitude and the correlation coefficient of the patterns is 0.81. There are several models that are closer to TRMM-3B42 than the observational uncertainty. Among these we highlight HadGEM3-GC31-MM as the model with the closest PL spatial pattern (r = 0.89) and GFDL-ESM4 as the model with the closest overall magnitude to TRMM-3B42 (scaling factor = 0.98). The MME benefits from the good performance of the best models in the spatial structure and averages the overall magnitude of PL in the different models. This results in a multimodel mean that is closer than GPCP to TRMM-3B42 in both PL spatial pattern and magnitude.
The model performance on the duration of dry spells is similarly encouraging. While all individual models and the MME simulate longer duration of dry spells than both TRMM-3B42 and GPCP (even after the models wet-day biases are greatly reduced; see the online supplemental material), the tL pattern correlation in almost all models is comparable, although reduced, with the pattern correlation between TRMM-3B42 and GPCP. Even though the magnitude of PL and tL (hence also extreme percentiles) differs from TRMM-3B42 in almost all models, the fact that the patterns are well correlated helps boost confidence in model projections of relative (i.e., percent) changes of daily precipitation and dry-spell duration extremes.
c. Spectral analysis
Following the method of Klingaman et al. (2017) implemented in Analyzing Scales of Precipitation (ASoP) version 1.0, we calculate the fractional contribution to the total mean rainfall from different intensities, at 3-h and daily time scales, sorted into 100 bins of varying width ranging from 0.005 to 2360 mm day−1. This reveals the relative importance of precipitation events in a given intensity bin to the total precipitation. The calculation is performed at each grid box, using a horizontal resolution that is sufficiently coarse for at least some spatial averaging to be carried out for all of the models and the observations. To avoid removing important spatial detail, we limit this resolution to 2° × 2°, thereby requiring us to omit models whose resolution is similar to or coarser than this. Calculations are performed for the whole year (ANN) and for each season, over 25 years (1990–2014) of CMIP6 historical simulations.
To evaluate the models, we use a similarity index (Perkins et al. 2007) to compare the fractional histograms from each model with those obtained from 19 years of GPM-IMERG observations (2001–19) at each grid point between 60°S and 60°N. This measures the overlap between the model and observed histograms, with values closer to 1.0 indicating that the histograms match better and a value of 0.0 indicating they are entirely separated. Metrics are the spatial root-mean-square of these indices over selected regions. Any region could be chosen for metric evaluation; here we have used six regions: global (60°S–60°N), tropics (15°S–15°N), land-only (30°S–30°N), sea-only (30°S–30°N), Northern Hemisphere (NH) midlatitudes (30°–60°N), and Southern Hemisphere (SH) midlatitudes (30°–60°S).
Figure 3a shows an example map of the indices from 3-h rainfall data from HadGEM3-GC31-MM versus GPM-IMERG. This suggests that performance is better over land than ocean, and over the midlatitudes than the tropics. Figure 3b shows the overall metric summary information for the 3-h time scale. This confirms that the pattern seen for HadGEM3-GC31-MM is similar for the other CMIP6 models and is consistent through the seasons. The stars indicate comparison of GPM-IMERG with other observation datasets, providing a measure of uncertainty. The metrics from the models nearly all lie outside this uncertainty range. Figure 3c provides additional information about the model–observation differences: the models are generally biased toward smaller rainfall accumulations, although there are a few models for which there is a greater than observed contribution from the largest rainfall accumulations. We find similar results for daily accumulations.
The metrics are a useful guide to the overall model performance, but the fact that the histograms are analyzed at each grid point, and that the calculations can be performed on any temporal or spatial scale, means there is much more information available from these diagnostics to users and model developers that could be used to understand model errors on a range of time scales (see, e.g., Martin et al. 2017). There is also the potential for subsampling of rainfall associated with organized systems or phenomena (such as tropical cyclones, fronts, MJO) prior to the histogram analysis, which could increase our understanding of these systems as well as providing information on model errors. The metrics could also be used to examine the influence of model resolution, ocean–atmosphere coupling, and the inclusion of Earth system processes on the spread of rainfall intensities.
d. Coherence analysis
The “Analyzing Scales of Precipitation” (ASoP) diagnostics (Klingaman et al. 2017) can measure, and compare, the spatial and temporal scales of precipitation across observations and GCMs. The “ASoP Coherence” package was designed to produce a single diagnostic or metric for a chosen region. Here, we extend the package to operate on gridded data. We measure spatial and temporal coherence in 3-hourly and daily precipitation in GPM-IMERG observations and CMIP6 historical simulations. We perform these calculations on a common 2° × 2° grid, a horizontal resolution that is sufficiently coarse for at least some spatial averaging to be carried out for all of the models and the observations while also avoiding removing important spatial detail. This requires us to omit models whose resolution is similar to or coarser than this. The calculations are performed between 60°S and 60°N, neglecting any point with annual-mean rainfall < 1 mm day−1 and, in the remaining points, any months in the dry season, defined as months that contribute, in the mean, less than 1/24 of the annual precipitation.
Figures 4a–c use 3-hourly data to show the temporal scale, defined as the first lag at which the temporal autocorrelation is <0.2, for the CMIP6 historical multimodel mean (Fig. 4a), GPM-IMERG (Fig. 4b), and the multimodel mean bias (Fig. 4c). Throughout much of the tropical and subtropical oceanic regions, the CMIP6 multimodel mean precipitation is too persistent, highlighting an area for model improvement. Figures 4d–f use daily-mean data to show the spatial scale, which is computed from the temporal correlation of the precipitation between each grid point and its surrounding grid points, using intervals of radii given in the color bar beneath the panel. The scale is defined as the first search radius at which the spatial correlation is <0.2. Daily precipitation spatial scales are larger in the CMIP6 multimodel mean (Fig. 4d) than in GPM (Fig. 4e), particularly in the eastern equatorial Pacific and Atlantic Oceans, and in near-equatorial regions of the Indian Ocean, as well as much of the subtropical oceans (Fig. 4f). Combined with the temporal scale results above, this suggests that CMIP6 models produce precipitation features that are too large and that last too long, particularly in the tropical oceans.
Klingaman et al. (2017) also defines spatial and temporal coherence metrics. The spatial metric is derived from the likelihood of coincidence of upper-quartile and lower-quartile precipitation at neighboring grid points; the temporal metric is derived from the likelihood of consecutive time steps of the upper quartile and lower quartile at the same grid point. Quartiles are computed for each grid point and each month of the seasonal cycle. For the temporal coherence metric, we show the aggregated grid point metrics (computed 60°S–60°N) as Taylor diagrams for global land (Fig. 4g), ocean (Fig. 4h), and all points (Fig. 4i). The CMIP6 models show higher centered RMS difference and lower correlations, against GPM-IMERG, for land points than for ocean points, indicating that persistence of land precipitation is another area for model improvement. The spatial standard deviation values of the coherence metrics shown in the Taylor diagrams can provide further insights for model improvements: models that have a smaller standard deviation than GPM-IMERG are typically too persistent across all grid points, as the mean bias is positive for nearly all models (not shown), while models that have a greater standard deviation are typically too persistent in some regions and too intermittent in others. These standard deviations show stronger negative biases over land than over ocean, indicating that models show little spatial variability in temporal coherence over land and hence cannot distinguish regions dominated by longer-lived rain-bearing systems from regions dominated by shorter-lived systems.
Next, we demonstrate the ability to compare the spatial scale of precipitation (now restricted to the tropical Indian Ocean: 10°S–10°N, 60°–90°E; using daily data; determined as correlations at a distance of 800 km) with two metrics of the MJO in CMIP6 models, two satellite observation datasets, and ERA5 (Fig. 4j). The satellite observations and ERA5 have an average precipitation spatial coherence of 0.06–0.09, and the CMIP6 models cover the range −0.03 to +0.26. CMIP6 models have a relatively close relationship between precipitation spatial coherence and the MJO Maritime Continent propagation metric (Ahn et al. 2020; R2 = 0.489). This suggests that those climate models with a higher spatial coherence of daily precipitation propagate the MJO more robustly east over the Maritime Continent. The relationship is weaker (R2 = 0.114) between precipitation spatial coherence and the MJO east/west power ratio, which measures MJO spatiotemporal structure (e.g., Sperber and Kim 2012; Ahn et al. 2017). There is no relationship between precipitation temporal scale and either MJO metric. Comparisons between spatiotemporal characteristics metrics and process- or phenomena-based metrics may be able to lead to greater insights and understanding of the origins of biases and model errors.
4. Process-oriented metrics
Although metrics of spatiotemporal characteristics are suggestive of the processes contributing to precipitation biases at different spatial and temporal scales, they do not by themselves represent processes related to precipitation. Here, process-oriented metrics are used to reveal relationships between precipitation and the thermodynamic environments, which provide important information on the ability of models to reproduce the observed relationships and the potential contributions of large-scale biases in the atmospheric environments to the precipitation biases. Here, we discuss two metrics highlighting the coupling of precipitation with the thermodynamic environments.
a. Rainfall–moisture coupling
Latent heating from tropical rainfall formation forces large-scale circulation anomalies that affect weather patterns globally through the tropical–extratropical teleconnection response (Stan et al. 2017). The onset of tropical heavy rainfall is critically dependent upon the relative saturation of the atmosphere (Bretherton et al. 2004; Neelin et al. 2009), while the teleconnection response is sensitive to the spatial and temporal scale of the heating anomaly (Yadav and Straus 2017; Wang et al. 2020). The MJO is a prominent example of a large-scale tropical disturbance that is strongly governed by column moisture (Adames and Kim 2016) and is also a major driver of tropical–extratropical teleconnections (e.g., Henderson et al. 2017). With this section, we aim to understand how tropical rainfall and moisture are coupled and how this coupling affects MJO simulation in CMIP6 models.
Following Wolding et al. (2020), daily tendencies of precipitation (P) and column saturation fraction (CSF; i.e., vertically integrated column water vapor divided by vertically integrated saturation column water vapor) over the Indo-Pacific warm pool are averaged within conditionally sampled CSF and P bins. All data are first remapped onto a common 2.5° × 2.5° grid. In Figs. 5a–c, joint CSF and P (CSF–P for short) tendencies are shown with vectors, which indicate if CSF–P departures above or below the mean CSF–P line lead to column moistening or drying. In observations and in most CMIP6 models, the vectors rotate clockwise about the mode (red circles in Figs. 5a–f) that corresponds to the quasi-equilibrium state (Neelin et al. 2008; Wolding et al. 2020). This clockwise rotation indicates that anomalously high precipitation for a given CSF is associated with column moistening, while anomalously low precipitation is associated with column drying. The strength of this rotation in each CSF–P bin can be diagnosed using a vorticity-like metric based on nondimensionalized CSF and P tendencies where positive values denote clockwise rotation (Fig. 5b). A scalar rotation metric R is then computed as the frequency-weighted rotation in CSF–P space.
For models with R > 0, positive moistening and rainfall tendencies are largest during the dry-to-moist transition when P is much greater than its mean value for a given CSF (solid red line in Figs. 5a–c). Analysis of radar data collected over the tropical Indian Ocean indicate that this state is associated with a transition from trade wind cumulus to cumulus congestus (Wolding et al. 2020). Negative moistening and rainfall tendencies are largest when CSF is greater than its average value for a given P (red dashed line in Figs. 5a–c), a state associated with widespread stratiform rainfall with embedded convection. For models with R < 0, higher-than-average rainfall at intermediate CSF is associated with strong drying; positive P tendencies are only observed at high CSF. Rainfall–moisture coupling in R < 0 models suggests that exaggerated depletion of column water vapor by rainfall leads to excessive drying at intermediate CSF, thus reducing the likelihood of subsequent heavy precipitation. Heavy precipitation in these models is only observed at high CSF, where the environment cannot be rapidly dried by rainfall.
Correlations between the R metric and several MJO propagation “pattern correlation” metrics for a subset of CMIP6 models suggest that tropical rainfall–moisture coupling plays an important role in regulating MJO periodicity. Various MJO pattern correlation metrics have been used to assess MJO propagation in models by correlating simulated and observed rainfall lagged regressions over the warm pool. Jiang et al. (2015) computed pattern correlations of regression coefficient using the composite propagation plotted in Fig. 5h (i.e., the “full” metric). Wang et al. (2018) and DeMott et al. (2019) reduced the influence of MJO period on the pattern correlation by masking coefficients within ±15° longitude of the rainfall basepoint (the “masked” metric), while Ahn et al. (2020) completely removed periodicity effects by only considering positive coefficients in a small portion of the domain east the Maritime Continent (the “MC-crossing metric”). Correlations between the R-metric and the full, masked, and MC-crossing propagation metrics are 0.47, 0.23, and 0.11, respectively. The correlation is only statistically significant for the full pattern correlation metric, which measures the combined effects of MJO propagation and period.
b. Temperature–water vapor environment
The aim of this module is to create metrics that capture the typical range of moisture and temperature over which precipitation is produced by condensing information from prior diagnostics (which also provides information on sensitivity to sampling and resolution; Kuo et al. 2018, 2020). Here we use a thermodynamic space in which temperature is measured by the vertically integrated saturation humidity, qsat, and moisture is measured by column relative humidity, CRH = CWV/qsat, where CWV is column water vapor, for each qsat. Figure 6a shows, for qsat = 65.5 mm over tropical oceans, the conditional mean precipitation rate (circles) and precipitation contribution (lines) from observations and one model instance. For observations, we use precipitation from the TRMM PR, column water vapor from the TRMM Microwave Imager (TMI), and ERA5 temperature for computing qsat (for an alternative combination of observations, we use MERRA-2 temperature in Figs. 6c,d). The PR is coarse-grained to 0.25° × 0.25°, compatible with the CWV resolution; results are insensitive to resolution up to 1.5° (Kuo et al. 2018). The observed precipitation rate sharply picks up as CRH increases above a certain threshold. The precipitation contribution peaks near this value because the system spends less time at the high precipitation values and the many occurrences of low CRH contribute little to precipitation. The MIROC-E2SL model exhibits qualitatively similar behavior, although the precipitation pickup is too weak and begins at lower CRH than observed, as seen more clearly in the peak of the precipitation contribution. To characterize the moisture range over which precipitation is produced, we identify the CRH values associated with the 25th and 75th percentiles of precipitation contribution for each qsat. These CRH values for qsat (tropospheric temperature environment) between the 25th and 75th percentiles of qsat (blue lines) are shown in Fig. 6b, together with the precipitation contribution as a function of CRH and qsat (color contours). A notable feature is that the CRH values associated with the 25th and 75th percentiles as well as peak of precipitation contribution decrease as qsat increases (i.e., precipitation is produced at lower CRH in a warmer environment).
The values associated with these percentiles provide a good summary of the observed thermodynamic range associated with precipitation, shown by the blue trapezoid in Fig. 6b. We choose a visual reference range (gray box) and repeat it in Figs. 6c and 6d. Figure 6c presents typical thermodynamic ranges associated with precipitation from a subset of CMIP6 historical simulations and two observational combinations. Deviations of the trapezoids from the observed along the qsat axis indicate cold/warm biases in the simulation, and deviations along the CRH axis indicate that models tend to produce precipitation outside the observed CRH range. Figure 6d exhibits the thermodynamic ranges as in Fig. 6c, but for the 17 available CMIP6 models, ranked by the precipitation contribution error defined as the L2 difference between the observed and model-simulated precipitation contribution (i.e., the mean square of the dotted area in Fig. 6a), averaged over the four most probable qsat bins. This scalar metric focuses on relative humidity rather than temperature bias. It is encouraging to see that some of the models can produce most of their precipitation in a thermodynamic environment close to the observed range both by the scalar metric and rhomboid location. Other models fare poorly by these measures. Most models capture the decrease in the CRH for the 75th-percentile precipitation contribution with increasing temperature, but only about half capture this feature for the 25th percentile.
5. Phenomena-based metrics
Phenomena-based metrics emphasize weather features such as synoptic systems and different types of storms that generate precipitation. While synoptic systems such as fronts may be broadly resolved by GCMs at typical 1° resolution, storms such as tropical cyclones, LPS, and MCS require higher-resolution modeling. Models’ ability to simulate these storms is critical as they are key contributors to extreme precipitation in many regions. Feature tracking (briefly summarized in section 2b) is used to identify and track the weather features, allowing precipitation associated with these features be isolated and evaluated using different metrics that measure model–observation differences. Here, four examples of weather features and associated precipitation are discussed.
a. Low pressure systems
A wide variety of synoptic-scale disturbances that consist of balanced flow around a pressure minimum produce precipitation in Earth’s tropics and extratropics. Classic examples are midlatitude baroclinic waves, which often produce intense precipitation through semigeostrophic uplift in their frontal zones, and tropical cyclones, which produce precipitation through the radial, frictionally balanced component of their circulation. Understanding the mechanisms by which such systems amplify and generate precipitation requires tracking the systems from initial genesis; this can be a difficult task, requiring data of sufficiently fine resolution and algorithms of adequate robustness to unambiguously represent the weak and sometimes horizontally small low pressure center. Here we illustrate how a strategic choice of variables allows for improved tracking of low pressure systems (LPS) in the South Asian monsoon, which produce a large fraction of that region’s annual mean rainfall as well as many extreme precipitation events. This tracking exercise allows the relationship of circulation with precipitation to be characterized in observations and model ensembles.
Tropical LPS are most commonly identified and tracked using lower-tropospheric vorticity or sea level pressure. Even for strong tropical cyclones, ambiguities in the criteria used in the tracking algorithm can lead to large uncertainties in the number of storms identified in observationally constrained gridded data (e.g., Murakami 2014). This issue is even more problematic for weak LPS, where the noisiness of the vorticity field produces irregular, broken tracks for systems that seem to move smoothly when tracked subjectively using a standard suite of meteorological data (Fig. 7a). Sea level pressure, which is less noisy, is sometimes used to track LPS instead but is ill suited for South Asian LPS, which typically have winds that peak around 3 km above the surface; geopotential height near the level of maximum wind also does not capture the full rotational flow given the low latitude and high Rossby number of these storms. Physical reasoning, as well as systematic assessment of multiple candidate variables with hundreds of combinations of quantitative tracking criteria, showed that the streamfunction of the horizontal 850-hPa wind is an optimal variable to use for tracking these LPS (Fig. 7b; Vishnu et al. 2020). This streamfunction represents the full nondivergent wind, even when geostrophic balance does not hold, yet retains the smoothness of the geopotential or sea level pressure fields; it was inverted using a method to avoid contamination by any wind data extrapolated below Earth’s surface (Vishnu et al. 2020).
Precipitation in South Asian monsoon LPS is known to fall southwest of the storm center, where the interaction of the storm’s rotational flow with the background vertical shear produces quasigeostrophic uplift (Rao and Rajamani 1970; Sanders 1984). This placement of peak precipitation is well captured when compositing TRMM precipitation relative to ERA5 LPS tracks (Fig. 7c). ERA5 also accurately represents the well-known distribution of track density, with storm frequency peaking strongly over the northwest Bay of Bengal (Fig. 7b). Recent work has shown that LPS frequency likely peaks in that small region because the large-scale, low-level monsoon winds are barotropically unstable there (Diaz and Boos 2019) and vapor pressures are large with strong horizontal gradients (Ditchek et al. 2016; Adames and Ming 2018). Wind-enhanced evaporation from the Bay of Bengal may also enhance LPS intensity there (Murthy and Boos 2020; Fujinami et al. 2020; Diaz and Boos 2021).
By tracking LPS in ensembles of GCMs, we can create composites that allow model precipitation bias to be assessed in a phenomenon-based system rather than in a space- or time-based system that averages many types of atmospheric disturbances. One high-resolution GCM (E3SM integrated at 0.25° resolution) represents the track density of South Asian monsoon LPS well, in addition to the spatial structure of precipitation relative to the vortex center (Figs. 7b,c). This is notable given the poor ability of some coarse-resolution GCMs to simulate these LPS (Praveen et al. 2015). However, the E3SM model simulates monsoon LPS rainfall that is too weak, with the peak storm-centered composite precipitation being about half that observed (Fig. 7c). Other models exhibit a variety of biases in their representation of monsoon LPS precipitation with differing sensitivities to model resolution. Storm-centered composites in the CNRM models have overly strong precipitation with little sensitivity to model resolution, while the MRI models produce roughly the right amount of precipitation over the entire storm but with a spatial pattern that, unexpectedly, degrades at finer model resolution (Fig. 7d). These biases are large for some models, exceeding 50% of the system-averaged TRMM rain rate of 15 mm day−1; interannual variations in LPS activity and storm-centric rain rates are substantially more modest (e.g., Sikka 1980; Krishnamurthy and Ajayamohan 2010; Vishnu et al. 2020).
Such assessment of model skill in representing the synoptic systems that produce extreme rainfall, such as monsoon LPS, is an important step in producing reliable projections of future extreme rainfall. The LPS dataset used here, which is available for five modern reanalysis products, provides LPS tracks throughout the global tropics that can be used to better understand a variety of synoptic-scale phenomena, including the weak progenitors of tropical cyclones.
b. Mesoscale convective systems
Mesoscale convective systems (MCSs) are ubiquitous over the tropics year-round and in the midlatitudes during the warm season. Besides contributing to over 50% of the annual precipitation in most regions of the tropics and selected regions in the midlatitudes (Nesbitt et al. 2006; Feng et al. 2021b), MCSs are also key contributors to extreme precipitation, partly because of their larger size and longer lifetime compared to individual convective storms (Stevenson and Schumacher 2014). Because of the distinctive nocturnal timing of MCS, erroneous diurnal timing of summer precipitation produced by models has been used to infer their failure in simulating MCSs. Recent efforts in developing algorithms to identify and track MCSs in observations (Feng et al. 2018) and model simulations (Feng et al. 2021a) have provided unprecedented opportunities to directly evaluate MCSs and their characteristics in weather and climate models using MCS-specific metrics.
Using FLEXTRKR, an algorithm developed to track MCSs using both infrared brightness temperature (Tb) and precipitation features (PFs) (Feng et al. 2018, 2019), a global (60°S–60°N) MCS tracking database has been developed at ∼10 km and hourly resolution (Feng et al. 2021b). Combining the track locations and precipitation, this database can be used to derive information of the MCS number, MCS precipitation and its fractional contribution to the total precipitation, MCS maximum precipitation rate, MCS lifetime, and MCS translation speed and direction. As MCSs are not well defined at coarser spatial resolution, we develop MCS metrics mainly for use in evaluating high-resolution weather and climate simulations with grid spacing < 50 km. Instead of coarse graining the observations and model outputs, which correspond to a range of grid spacing, to a common resolution, we use specific PF criteria derived for a given resolution for MCS tracking to facilitate comparison across datasets of different resolutions (Feng et al. 2021a).
Figures 8a and 8b compare the MCS number tracked using two algorithms, a more commonly used method that tracks MCSs using Tb only versus FLEXTRKR that tracks MCSs using both Tb and PF. These two methods produce similar observed total MCS number and spatial distribution in the tropics, but larger differences are noticeable in the midlatitudes. Including PF in MCS tracking noticeably reduces the number of MCSs in the midlatitudes by disqualifying large cold cloud systems (e.g., synoptically forced) with small area and/or low rainfall intensity PF as MCSs. Using only IR Tb, the model (E3SM) simulates too many MCSs (blue contours) except in a few locations. In contrast, using both IR Tb and PF, E3SM simulates too few MCSs (magenta contours) except in a few locations. These results show that large cold cloud systems are produced by the model too frequently but many of them fail to meet the PF thresholds. This is supported by the composited MCS rain rates shown in Fig. 8c for northeast moving MCSs in the central United States during spring (MAM) and summer (JJA). The simulated and observed rain rate composites have similar size, but the model produces much lower peak rain rates. A higher fraction (65%) of MCSs in the model have a northeast propagation than observed (44%).
Figure 8d summarizes the MCS precipitation metrics for four models in HighResMIP. The pattern correlation, root-mean-square error (RMSE), and bias are calculated based on comparison of the observed and simulated composited MCS rain rates over the central United States. Since hourly Tb is not available from the HighResMIP models except E3SM, MCSs are tracked using an algorithm that depends only on PF, trained using MCSs tracked using both Tb and PF (Feng et al. 2016). Note that E3SM is a free-running fully coupled simulation with constant 1950 forcing while other simulations are atmosphere-only simulations driven by observed sea surface temperature and sea ice distribution. The models exhibit a range of biases from larger negative (E3SM) to larger positive (NICAM) and the skills are generally lower during summer than spring. The seasonal difference is particularly large for NICAM. Unlike the other models that parameterized deep convection, no deep convection scheme was used in NICAM at 56-km grid spacing. Last, it is worth noting that metrics based on composited MCS precipitation can only reveal differences in PF qualified as MCS. All models evaluated here display significant dry bias in the summer, consistent with the ubiquitous warm, dry bias noted in CMIP5 (Lin et al. 2017), as the models simulate much lower numbers of MCSs compared to observations. Therefore, we emphasize the importance of using multiple metrics for comprehensive evaluation of precipitation in models.
c. Frontal precipitation
Fronts have been identified using the method described in section 2, applied to ERA-Interim and five CMIP6 models, giving gridded front objects on a 2.5° grid. The fronts are linked to daily precipitation, using GPCP 1DD as an observational precipitation estimate. The precipitation data are regridded to the same resolution as the fronts in order to make the linking simpler. We consider precipitation only if it is above a threshold of 1 mm, which is the minimum 24-h precipitation a gauge can measure, and this eliminates some of the “drizzle problem” that models tend to have (Stephens et al. 2010). The precipitation is associated with a front if it lies within the front area of influence (which is equivalent to being in the same grid box or the surrounding eight grid boxes) during any of the four 6-hourly reanalysis times in the 24-h precipitation period. From this association of fronts and precipitation, we can produce the diagnostics of frontal (and nonfrontal) precipitation frequency (Ff, Fnf), frontal (and nonfrontal) precipitation intensity (If, Inf), frontal amplification factor (Af = If/Inf), and fraction of total precipitation from fronts (Pf) [see Catto and Pfahl (2013) and Catto et al. (2015) for full details]. Comparing the model diagnostics to the observational estimates from ERA-Interim and GPCP, we can produce a number of metrics, including the correlation, RMSE, and bias of these values.
Maps (Fig. 9a) of the error decomposition for term 1 (contribution from frequency of frontal precipitation) show that there are large regions of positive bias contribution. Errors are largely confined to the regions of maximum storm track activity and in the NH the largest positive bias contributions can be seen over the Kuroshio, over western Europe and parts of the North Atlantic, and at the end of the Pacific storm track into North America. In the SH the largest positive contributions are in a band between 30° and 40°S, particularly around the south coast of Australia. Term 2 errors (contribution from intensity of frontal precipitation) are generally largest in the same regions and indicate negative contributions to the total bias, with this being particularly notable over the North Atlantic region. The maps indicate a compensation of biases between terms 1 and 2, which is confirmed for each of the models in Fig. 9b and is consistent with the CMIP5 models (Catto et al. 2015). In the midlatitudes the contribution to the total precipitation error from the nonfrontal precipitation terms is small (Fig. 9b), as expected due to the high frequency of fronts.
The models all overestimate Af due to larger negative biases in the nonfrontal precipitation intensity than the negative biases in frontal precipitation intensity (not shown). These biases are large compared to the GPCP Af of 1.28 in NH DJF and 1.35 in SH JJA and are strongly correlated with the model biases in the intensity of the frontal precipitation (not shown). The spatial correlation is between 0.4 and 0.6 in the NH and between 0.3 and 0.4 in the SH, indicating a better representation in the NH.
The proportion of total precipitation associated with fronts in the winter seasons is 0.50 in the NH and 0.54 in the SH for GPCP and ERA-Interim. The biases in this quantity range between 0.02 and 0.27 (Fig. 9d), with most models showing a better representation in the SH. The models that perform better for the proportion do not necessarily show better performance in the Af metric, indicating the utility of looking at more than one metric.
Analyzing the ranks of the models using the various calculated metrics, we can see that some models that perform well in metrics that quantify magnitude differences (e.g., the decomposition terms and biases) also perform poorly in their spatial correlation, (e.g., IPSL-CM6A-LR). Again, this points to the importance of considering a number of different metrics to investigate the model performance.
d. Atmospheric rivers
Atmospheric rivers (ARs) are long narrow bands of poleward vapor transport often associated with the warm sector in advance of midlatitude cyclone cold fronts (Ralph et al. 2018). They account for a large fraction of wet-season precipitation in a number of regions (Dettinger 2011; Rutz et al. 2014; Guan and Waliser 2015), and they account for a majority of the poleward moisture transport (Gimeno et al. 2014). Previous studies examining ARs in climate model simulations have assessed the ability of models to adequately simulate relevant characteristics of ARs, including global and landfalling frequency, intensity, precipitation, duration, life cycle, and so on (Dettinger 2011; Payne and Magnusdottir 2015; Shields and Kiehl 2016; Goldenson et al. 2018). In this module, we present two metrics aimed at answering the following questions: 1) Do models simulate AR-related precipitation in the correct locations? 2) Do models simulate enough contrast between regions with high AR precipitation and low AR precipitation? 3) Does the diversity of AR detection and tracking (ARDTs) affect the above conclusions?
We utilize output from six global ARDTs that participated in the ARTMIP Tier 1 experiment and Tier 2 CMIP5/-6 experiment (see section 2b); these ARDTs identified ARs in MERRA-2 and in historical simulations from nine members of the CMIP5 and CMIP6 multimodel ensembles. We quantitatively define “AR-precipitation” for each ARDT as precipitation occurring when AR conditions are identified by a given ARDT. We calculate AR-precipitation for MERRA-2 (using the precipitation field from MERRA-2) and for the CMIP5 and CMIP6 simulations. We calculate 30-yr averages of these quantities and regrid all to a common 2° × 2° grid to facilitate direct comparison of the fields between the simulations and the reanalysis. Additionally, we calculate AR-precipitation for ERA 20C (1900–2010) to provide a combined estimate of observational uncertainty and natural variability (since we use a different time period than with MERRA-2). Figures 10a and 10b show the bias in AR-precipitation between two CMIP6 models, with one model’s bias field indicating some regional biases in AR-precipitation (Fig. 10a) and another model’s bias field indicating systematically too little AR-precipitation (Fig. 10b).
The spatial correlation coefficient of AR-precipitation between each model simulation and MERRA-2 is used to answer question 1 above, and the ratio of the spatial standard deviation of AR-precipitation between each model and MERRA-2 is used to assess question 2. These quantities are calculated for all available model–ARDT pairs in order to assess question 3. Figure 10c shows a Taylor diagram constructed by plotting the spatial correlations on the azimuthal axis and the ratio of the standard deviations on the radial axis.
It appears that models generally produce AR-precipitation in the correct regions, but they do not have enough spatial variability in AR-precipitation. The models have relatively high spatial correlation coefficients (regardless of which ARDT is used), with most models having coefficients between 0.8 and 0.95. It is notable, however, that the value of the spatial correlation coefficient can depend strongly on which ARDT is used. Consider results from the CMIP5 CCSM4 simulation (navy blue markers), which range from about 0.7 when evaluated with the GuanWaliser v2 ARDT to over 0.9 with the ARCONNECT v2, Lora v2, and TECA BARD v1.0 ARDTs. In contrast to the spatial correlation, all models have less variability than the MERRA-2 simulation, and models exhibit a wide range of skill in this metric.
Across the ARDTs used, some models form distinct clusters in the Taylor diagram, with the CMIP6 MRI-ESM2-0 and CMIP5 CCSM4 simulations having systematically low Taylor skill values and the CMIP5 CanESM2 simulation having systematically high Taylor skill values. These distinct clusters indicate consensus among the ARDTs about the model skill. In contrast, some models span the Taylor diagram; for example, the skill of the CMIP5 IPSL-CM5A-LR simulation depends strongly on which ARDT is used, with the TECA-BARD v1.0 giving a Taylor skill score of approximately 0.87 and ARCONNECT v2 giving a skill score of only about 0.32. Comparing between generations, the CMIP6-CM6A-LR simulation has systematically higher Taylor skill scores than either of the CMIP5 IPSL simulations. Further, the CMIP6-CM6A-LR simulation forms a distinct cluster in the Taylor diagram, suggesting a consensus among ARDTs that the CMIP6 version of the IPSL model is superior to the CMIP5 versions.
The ARDTs exhibit distinctive differences in model evaluation. Metrics calculated with the TECA-BARD v1.0 ARDT (star markers in Fig. 10c) are systematically higher than any other ARDT, and most models evaluated by TECA-BARD v1.0 appear skillful at simulating AR-precipitation. The notable exceptions are the CMIP6 MRI-ESM2-0 and CMIP5 CCSM4 simulations, which—as noted previously—have low metric scores no matter which ARDT is used, which is due to a systematic low bias in AR-precipitation in the simulations (e.g., Fig. 10b). Other ARDTs, such as ARCONNECT v2, have a wide spread in the AR-precipitation metrics.
These differences among ARDTs are partly related to their designs. ARCONNECT v2 utilizes an absolute threshold in IVT when identifying ARs, which would make the ARDT much more sensitive to biases in model humidity and/or winds. If a simulation has a systematic low bias in IVT, for example, then the ARCONNECT v2 ARDT will detect systematically fewer ARs in that simulation. Other ARDTs, such as Lora v2 and TECA-BARD v1.0, utilize relative IVT thresholds, which may be less sensitive to model bias.
6. Discussion and summary
With a primary goal of introducing a suite of exploratory precipitation metrics and demonstrating their use in evaluating precipitation in climate models, we minimized the hurdle by allowing different groups to apply their diagnostics and metrics to readily available model outputs using their preferred or readily available benchmark datasets. Although most of the metrics were applied to CMIP6 simulations including HighResMIP, the number of models evaluated ranges between 4 and 35. Because feature tracking generally requires more variables and higher temporal frequency data, the LPS, MCS, FRT, and AR metrics were demonstrated using only 4–9 simulations. Although all other metrics were applied to a much larger number of CMIP6 simulations (17–35), differences in the specific simulations used and whether a single or multiple members of a model family were used make comparison across models and metrics difficult.
Despite the difficulty in drawing broad conclusions, some general observations can be made for each metric and by comparing across models and metrics. For precipitation diurnal cycle, models generally perform much better over ocean than over land, as models have a tendency to produce peak precipitation in the afternoon over land while the observed peak precipitation occurs in the late afternoon/early evening. There is a relatively strong negative intermodel correlation between biases in the diurnal amplitude and phase over ocean but such correlation is positive and weaker over land. Almost all the examined models fail to capture the nocturnal peak observed at the ARM SGP site. For precipitation and dry spells, models perform well in simulating the spatial pattern of both daily precipitation and duration of dry-spell cutoff scales, which means that models would also do well in simulating the spatial distribution of extremes. However, there is a larger spread in terms of scaling factor (i.e., the overall magnitude of the patterns), with the daily precipitation cutoff scale closer to observations than the dry spell duration cutoff scale. Pattern correlation and scaling factor are largely independent metrics as their intermodel correlations are relatively low. In contrast with the precipitation diurnal cycle, spectral analysis shows that models perform better over land than ocean (between 30°S and 30°N) and better over the NH midlatitudes (30°–60°N) than the tropics (15°S–15°N). The majority of the models analyzed have their spectra overlapping with observations by more than 60% in all of the regions and seasons, but the metrics from the models nearly all lie outside the spread of the observation datasets used. The temporal and spatial coherence analysis highlights that the CMIP6 models generally produce precipitation features that are too large and that last too long, particularly in the tropical oceans. Despite these general tendencies, models have a wide range of abilities, with some producing good spatial and temporal variability while others perform poorly at both. There are stronger negative biases over land than over ocean, indicating that models show little spatial variability in temporal coherence over land and hence cannot distinguish regions dominated by longer-lived rain-bearing systems from regions dominated by shorter-lived systems. In the tropical Indian Ocean there are some relationships between the precipitation coherence and MJO metrics (Maritime Continent propagation).
For the process-oriented metrics, coupling of rainfall tendencies and CSF tendencies over the Indo-Pacific warm pool (Fig. 5) is well simulated in 5 of the 20 models analyzed for that metric, and poorly simulated in 8 models; the remaining models with neutral skill may either overestimate or underestimate the rainfall-moisture “rotation” metric derived from this diagnostic. While the rotation metric is modestly correlated with the MJO pattern correlation metric (r = 0.41), several models may perform well in one metric, but poorly in another, indicating that rainfall–moisture coupling alone is not a good predictor of a model’s ability to simulate the MJO. For the temperature–water vapor environment, almost half of the models produce most of their precipitation over tropical oceans in a temperature–moisture environment that is reasonably close to the observed range (using twice the distance between the two observational estimates as the reference range). This reflects that the deep-convective parameterizations in these models have included a substantial dependence of convective updrafts on lower free-tropospheric humidity (Kuo et al. 2017). Such a precipitation–temperature–water vapor relationship, however, is not perfectly aligned with other metrics related to precipitation and atmospheric moisture, as will be discussed further below.
In the category of phenomena-based metrics, all HighResMIP models examined here simulated synoptic-scale vortices (i.e., LPS) over South Asia with the qualitatively correct spatial structure of rainfall, with no improvement in model skill at finer horizontal resolution in the two models for which low- and high-resolution versions were examined. This contrasts with prior studies that found LPS were simulated more accurately at finer resolutions; different result may be due to use of a range of coarser resolutions than examined here (Praveen et al. 2015) or the use of only one model (Sabin et al. 2013). In contrast to the general skill in simulating the spatial structure of precipitation within LPS, models exhibited a wide range of biases in representing the amplitude of LPS precipitation, with the three main models examined showing large negative bias, large positive bias, and low bias, with the bias magnitude changing little or, unexpectedly, even degrading at finer resolution. For MCS metrics, the four HighResMIP models evaluated show varying skill in reproducing the observed composited MCS rainfall in the central United States, with model ability to simulate intense convective precipitation a distinguishing factor. Skill scores are worse in summer than spring in all models, consistent with the more dominating frontal large-scale environments of MCS in spring, which are more skillfully simulated by global models (Song et al. 2019). The precipitation error decomposition into frontal precipitation frequency and intensity indicates that all the models evaluated have compensating biases. They produce frontal precipitation too frequently, with intensity that is too low. This is consistent with the results from CMIP5 in Catto et al. (2015), although the CMIP6 models so far seem to have smaller errors. The total precipitation coming from fronts is well represented in the models, including the spatial patterns, indicating good representation of fronts themselves. For the AR precipitation metric, ERA-20C has a Taylor skill score of 0.96 relative to MERRA-2 when assessed using the TECA_BARD_ARDT AR tracking method, which provides a measure of observational uncertainty in the metric. Considering the inter-ARDT spread in the Taylor skill score, no models perform well in simulating AR precipitation as none is within one standard deviation of ERA 20C score.
As our diagnostic analysis has been summarized succinctly using scalar metrics, meta-analysis of model skill can be facilitated by developing a matrix of skill scores for models versus metrics to reveal possible relationships among metrics and models. Comparing across metrics and models, it is clear that model skill varies substantially. To help reveal potential relationships among metrics and models, we identified the top-5 and bottom-5 simulations evaluated by each category of metrics (e.g., diurnal precipitation) and its subcategories (e.g., amplitude and phase of diurnal precipitation). The results of this relative model ranking are not shown, as we focus on insights that can be gained from the comparative analysis rather than highlighting the performance of specific models. Consistent with the diverse model skill exhibited across metrics and models, only two model families are in the top-5 group for more than three different categories of metrics and are not in the bottom-5 group in any metrics. Similarly, only one model family is in the bottom-5 group for more than three categories of metrics and is not in the top-5 group in any metrics. Many models perform well in some metrics but poorly in other metrics. There is a general tendency for simulations produced by the same model family but using different resolutions, model versions, or model configurations, to perform similarly, although some exceptions can also be found.
Focusing on the actual model skill for each metric, we also identified the good and poor performing models in an absolute sense to determine how well models perform for each metric, and subsequently ranked the metrics according to those in which most models performed well or poorly. This absolute skill and ranking was determined by the developers of each metric based on their own judgement, which generally involved comparing model skill relative to some uncertainty related to observation data, and for ARs, uncertainty in tracking methods is also considered. A few metrics that stand out with more models performing well and poorly are highlighted here. Notably, more than 50% of the simulations evaluated based on the diurnal amplitude and phase of precipitation over ocean are considered skillful, while the same is true for the evaluation of spectral characteristics over land and the NH midlatitudes, and for the scaling factor of daily precipitation and the pattern correlation of the cutoff scale between the simulated and observed duration of dry spells. In contrast, two metrics stand out as more challenging for models, with more than 50% of the simulations considered to be performing poorly. These are correlation coefficients of the Taylor skill score for spatial coherence over both land and ocean and the AR precipitation Taylor skill score. Last, more than 50% of the simulations are considered neutral (neither skillful nor poor) with respect to several metrics including diurnal amplitude and phase over land; spectral analysis over ocean, tropics, and SH midlatitudes; and MJO pattern correlation. For other metrics, models are more mixed in how well they represent the specific precipitation characteristics evaluated.
Based on the relative and absolute ranking, additional insights can be gained with regard to the potential relationships among the metrics by calculating the correlation coefficients between the model ranking based on different metrics for the overlapping models, although not all metrics should be connected (e.g., due to geographical differences). For illustrative purposes, we calculated the correlation coefficients between the model ranking based on the temperature–water vapor environment and the model ranking based on other metrics for the overlapping models. We found relatively strong correlations (r > 0.5) of model skill in temperature–water vapor environment with model skill in precipitation cutoff scale (both pattern correlation and scaling factor), spectral analysis, temporal and spatial coherence, and MJO propagation based on the relative ranking. On the other hand, model skill in temperature–water vapor environment has very low (r < 0.2) or negative correlations with model skill in diurnal precipitation over land (both amplitude and phase), diurnal precipitation over ocean (amplitude only), and dry spell cutoff scale (pattern correlation). Notably, correlations with the rotation metric and MJO east/west power ratio metric are also rather low (<0.3).
The above analysis is suggestive of some predictive power of the model skill in the temperature–water vapor environment on the model skill in several other precipitation characteristics. This motivates future work to understand these relationships by performing additional diagnostic analysis, and also to apply the exploratory metrics more systematically to the same set of model simulations using comparable benchmark datasets in order to support quantitative analysis of skill across models and metrics. This may reveal less obvious relationships among metrics and models, reflecting relationships among processes and/or weather phenomena highlighted by the metrics, or relationships among models due to commonality such as parameterization schemes. Such information is useful for guiding model development and model tuning. Machine learning approaches may be used to develop predictive models of the relationships among the different metrics presented here, or between those metrics and others such as metrics for the modes of climate variability (e.g., MJO, ENSO), circulation indices (e.g., monsoon), sea surface temperature pattern, etc. Such mapping of model skill scores across metrics may help focus efforts on improving model prediction skill given th