Bias Correcting Climate Model Simulations Using Unpaired Image-to-Image Translation Networks

: We assess the suitability of unpaired image-to-image translation networks for bias correcting data simulated by global atmospheric circulation models. We use the Unsupervised Image-to-Image Translation (UNIT) neural network architecture to map between data from the HadGEM3-A-N216 model and ERA5 reanalysis data in a geographical area centered on the South Asian monsoon, which has well-documented serious biases in this model. The UNIT network corrects cross-variable correlations and spatial structures but creates bias corrections with less extreme values than the target distribution. By combining the UNIT neural network with the classical technique of quantile mapping (QM), we can produce bias corrections that are better than either alone. The UNIT 1 QM scheme is shown to correct cross-variable correlations, spatial patterns, and all marginal distributions of single variables. The careful correction of such joint distributions is of high importance for compound extremes research.


Introduction
A large portion of research into the physical Earth system is reliant on the use of general circulation models (GCMs).GCMs are used for weather prediction and for climate studies on time scales of days to years, although their seamless use for both is still rare.They have been central to informing policy makers through the International Panel on Climate Change (IPCC 2013).
The IPCC report relies on a multiplicity of different GCMs, developed by research centers spread across the world.The latest Climate Model Intercomparison Project (CMIP6; Eyring et al. 2016) includes output from tens of GCMs.These differ in how the Earth system is discretized, in their approximations for and inclusion of subgrid-scale processes, and even down to integration schemes, all of which introduce uncertainty that is reflected in the spread of results across models.The climate is highly chaotic, and so these differences can lead to detectable differences in the GCM outputs (Wang et al. 2014;Maher et al. 2018).
GCM outputs are used to assess the risks associated with different weather events, such as droughts, heatwaves, floods, and wildfire risk.To quantify these risks, researchers must decide which GCMs are fit for this purpose.Persistent biases in climate models exist (Eyring et al. 2021) and need to be addressed by either selecting the best models or by correcting biases.The choice of models can have a quantitative and qualitative difference on estimated climate risks (e.g., Kirchmeier-Young et al. 2017;Herger et al. 2018).A GCM that has a low bias relative to observations in one variable and one geographical location may have a large bias in another (Ridder et al. 2021).This makes compound risks (Leonard et al. 2014), such as simultaneous heatwaves and drought, even harder to assess as they involve multiple variables and may involve multiple geographical areas.Simultaneous crop failure in multiple regions of high agricultural output (Gaupp et al. 2020) is a risk of recent study.Particularly persistent biases exist in climate model simulated rainfall patterns, with many models exhibiting a double intertropical convergence zone (ITCZ) and misplaced monsoons (Wang et al. 2020;Tian and Dong 2020).
GCMs are improving, but in the interim, we must use their outputs effectively to understand our current climate and the potential effects of global warming.Therefore, we must devise methods to optimally correct biases in the output of GCMs (Bellprat et al. 2019).
Recently, modern artificial neural network architectures have been developed that can be used for bias correction and statistical downscaling (Moghim and Bras 2017;Steininger et al. 2020;Le et al. 2020;Han et al. 2021;Wang et al. 2021) and that offer some theoretical advantages over classical techniques.In this paper, we will focus on unpaired image-to-image translation networks to perform bias correction between a GCM and observations.These neural networks use layers of convolutional filters that are applied across multiple climate variables simultaneously.This gives these architectures the capacity to "see" and, therefore, correct both spatial and cross-variable relations simultaneously.This is an advantage over current methods.These networks are presented in more detail later.
The simplest classical method of bias correction is to adjust the climatological mean, yet such a method may leave extremes still biased (e.g., Hanlon et al. 2015).Another very commonly used method is quantile mapping (QM) (Cannon et al. 2015).This is a simple method where a single value of a variable x GCM obtained from the GCM at a spatial location and time of year denoted by coordinate u, is converted into a percentile using the estimated cumulative distribution function F GCM .Then an equivalent observation value xobs is obtained using the inverse cumulative distribution function F 21 obs : xobs 5 F 21 obs [F GCM (x GCM ;u);u]: (1) This approach does not capture conditional relationships.The QM-predicted value of xobs does not use the values of x GCM in neighboring locations in space.This means, for example, if some weather event in a GCM has a different characteristic shape than in observations, then QM cannot reshape it coherently.This could be the size and shape of cyclones or the position of storms along the polar front.Further, QM does not use the values of other variables at the same spatial location.Therefore, relationships between variables are severed and may become physically unrealistic.For example, if a pixel is translated from a dry day in the GCM to a wet day in equivalent observations, the relationships between surface temperature and precipitation (Trenberth and Shea 2005) may not be preserved.
These are limitations that modern neural network architectures could be ideally suited to improve upon.Both of these features would be required to accurately correct for the presence and strength of teleconnections (Yuan et al. 2018;Stan et al. 2017).A specific example is shown in Maraun et al. (2017), where extreme precipitation in Piura, Peru, only occurs during El Niño events in observations.These extreme rainfall events did not occur in the GCM outputs, so when QM was applied to the GCM data the extreme rainfall events no longer co-occurred with El Niño.El Niño events are primarily characterized by warm sea surface temperatures in the equatorial Pacific Ocean (Timmermann et al. 2018), and so conditional bias correction that takes spatial and cross-variable information is required.
Previous work has attempted to solve these issues with classical techniques, but incompletely.In Levy et al. (2013), the authors propose optimally stretching simulated precipitation fields to match precipitation patterns in observations.They found that when precipitation features were stretched onto the correct places, human attributable change in precipitation could be more easily detected, while misplaced features lead to poor fingerprints of the expected climate change.However, this technique uses monthly average precipitation and does not allow the use of daily data, which is important for studying extreme risks.It also cannot easily be extended to multiple variables.In Cannon (2018), the authors propose a way to generalize QM to N dimensions.This allows the user to transform multiple variables in multiple spatial locations using daily data.However, as they note, their method cannot be extended to many spatial points in many variables before it becomes computationally limited and becomes prone to overfitting.
We note that this problem of bias correcting GCM outputs is more generally the problem of mapping between two empirical distributions of multichannel images without having any one-to-one corresponding pairs.This is precisely the problem description of unpaired image-to-image translation (Liu and Tuzel 2016).
Some previous work has applied similar neural network architectures to this problem.Franc¸ois et al. (2021) apply an architecture based on a cycle-consistent generative adversarial network (CycleGAN; Zhu et al. 2017) to bias correct temperature and precipitation data in a region over Paris, but only for the winter season.Pan et al. (2021) develop a similar network architecture to bias correct precipitation data over the contiguous United States.They use the dynamical variables (sea level pressure, geopotential height, and specific humidity at 500 hPa) to aid the translations but do not bias correct these variables themselves.
In this paper, we extend on and complement these previous studies.We first introduce a different image translation method that has not been applied in the climate domain.We apply this method to bias correct simulations from a GCM of the South Asian monsoon across five variables, using a GCM that has substantial biases in the spatial pattern of the monsoon, and using reanalysis data as a target for the correction.This is a larger geographical region and uses more variables than previous studies, and this region has a known and very significant physical bias in the GCM.We evaluate the method's use by comparing its performance with quantile mapping, and we explore using these two techniques in conjunction to better represent the monsoon and other extreme events in the region.We analyze the translation results focusing on relationships between different variables as well as spatial relationships.Additional details on various parts of the paper are provided in the appendix.

A prime use for unpaired image-to-image translation networks
These architectures, such as Unsupervised Image-to-Image Translation (UNIT; Liu et al. 2017), CycleGAN (Zhu et al. 2017), and AlignFlow (Grover et al. 2020) are neural networks that incorporate the architecture of GANs (Goodfellow et al. 2014).The aim of these techniques is to translate between images {x i } N i51 in domain X and {y j } M j51 in domain Y, without requiring corresponding pairs {x i , y i }.
These networks were initially used to translate between images of summer and winter driving scenes, and between simulated city driving scenes and real city driving scenes (Liu et al. 2017;Zhu et al. 2017). Further, in Hao et al. (2021), they use these networks to make video game images look more photorealistic.In Shrivastava et al. (2017), the authors train a related refiner model to create more realistic-looking images of eyes from simulated 3D models.These are all examples of statistically bias correcting spatial fields with multiple variables, as we aim to do with GCMs.
To map between GCM outputs and observations, the ability to translate without pairs is absolutely necessary.Imagine an idealized GCM that captures all of the physics of the Earth system almost perfectly.However, due to discretization, it accumulates errors when integrating a climate state forwards in time.If it was initiated with the exact observed climate state c GCM (0) 5 c obs (0) we should expect that after approximately two weeks (Lorenz 1969;Zhang et al. 2019) the simulation will have diverged from the observations as a result of chaos.Therefore, image pairs from the simulation and observations collected beyond two weeks have no relation to each other.If the GCM is not perfect and hence has a bias, then the initial state matched to observations c GCM (0) 5 c obs (0) may be one that the GCM does not visit often, and the simulation will drift toward its own preferred states.This means we cannot use the image pairs collected during the first two weeks to bias correct the rest of the GCM simulation, that is, cannot use pairs {c GCM (t i ),c obs (t i )} t i ,2weeks .This would be predicting outside the limits of our training data.
This lack of corresponding pairs makes bias correction distinctly different than perfect prognosis and statistical downscaling (Maraun et al. 2010), so we cannot use the machine learning techniques employed therein.In these settings, predictions are made for short lead times and so gathering corresponding pairs of GCM predictions and observations is possible.Although the application of deep learning to these tasks and statistical nowcasting (Steininger et al. 2020;Vaughan et al. 2022;Ravuri et al. 2021;Sønderby et al. 2020) has been developing recently, it would not be valid to simply use this learned mapping for times after around two weeks.
We note that Wang and Tian (2022) attempt to bias correct temperatures from GCMs by assuming synchronized pairs between observations and simulations over long time frames.However, they admit this to be a limitation of their proposed method and that the dynamics of the GCM may be distorted.They also do not thoroughly test their assumption.

a. A brief overview of unpaired image-to-image translations networks
The UNIT network is composed of multiple subcomponents.A more precise breakdown and diagram of the components in terms of layers is available in section b of the appendix, and also in the original paper (Liu et al. 2017).The main backbone of the network is composed of two variational autoencoders (VAEs) (Kingma and Welling 2013).The encoder E X is a convolutional neural network that maps images from domain X to a region in latent space Z, that is, p(z|x) 5 E X (x).There is a similar encoder E Y that maps images in domain Y into a latent space that, if trained correctly, should be the same as Z.The main purpose of this network is to learn this shared latent space of the two image domains and learn functions to map into and out of it to the two domains.Decoders G X and G Y map vectors from the shared latent space Z to images in the domains X and Y, respectively, that is, x 5 G X (z).After training is completed, the composition function I X (x) 5 G X [E X (x)] should approximate the identify function (because of the VAE bottleneck, some information will be lost in the mapping and so the identity function cannot be exact).Also after training, the mapping F X"Y (x) 5 G Y [E X (x)] maps an image from domain X into its predicted equivalent image in domain Y. Similar statements are true for functions I Y and F Y"X .
In training this network we minimize the value of a compound loss function with multiple components.The loss components L recon are the cyclic reconstruction losses associated with translating from one domain to the other domain and back again.These losses are defined as (2) Here, E x;X (l) means the mean of the component l for data samples x sampled from X. Practically, this is estimated as the average of l over samples in a training batch.These losses encourage the latent space representation to encode as much information as possible about the images.They also ensure that the encoding is consistent across domains and that translating an input from one domain to the other and then back again reconstructs the same input.We use the L1-norm for these losses as it has been shown to support sharper translations (Zhu et al. 2017) in similar networks.We modify the L1 loss to weight the different channels with respect to each other (see section c in the appendix).
The following loss components are also used: The loss components L KL X and L KL Y are the Kullback-Leibler (KL) divergences associated with the samples from domains X and Y encoded into the shared latent space Z.These are standard loss components used to train VAEs (Kingma and Welling 2013).The function KL(z) computes the KL divergence between a multivariate Gaussian distribution with mean z and another with mean 0, both with variance of 1. Components L KL-cyc X and L KL-cyc Y are associated with the KL divergence of samples translated to the opposite domain and then encoded back into the shared latent space.
These KL divergence losses encourage the images to have a Gaussian distribution when encoded into the shared latent space.During training, a random perturbation is added in the latent space encoding during training so that z 5 E X (x) ÉX (x) 1 h, where h is a random vector sampled from a multivariate Gaussian distribution with unit diagonal variance.This so-called reparameterization trick is required to train VAEs (Kingma and Welling 2013) and ensures that the learned latent space Z is smooth}that is, that images that are similar in domain X will be translated to a similar region in domain Z.
To train the translation network, we also train two adversarial discriminator networks.Such discriminator networks have been used in recent work on short-term weather forecasting (Ravuri et al. 2021).A discriminator D X is trained to predict the probability that an image comes from the domain X or whether it was created by the conditional generator F Y"X .Similarly, D Y is trained to predict whether images in domain Y are real.These discriminator networks are trained simultaneously with the translation network.The following loss components are used to train the translation network functions F Y"X and F X"Y : 2 } and These loss components encourage the translations of the fields to look realistic in each domain.Any obvious imperfections or distortion of the distribution of images {F Y"X (y)} y;Y should be penalized by the discriminator (Goodfellow et al. 2014) via The discriminator D X is itself trained to minimize the loss function and the loss for D Y is similar.This is the same loss function as is used in least squares GANs (Mao et al. 2017) and has been shown to be more stable in training than traditional GANs.The full loss function for the translation network is where the scalars l rec , l cyc , l KL-rec , l KL-cyc , and l GAN were set to the values used in the original work introducing UNIT.More details about these losses and the UNIT hyperparameters are presented in section c of the appendix.

b. What do bias corrected simulations represent?
It is important to consider what the translated sequence F GCM"obs [c GCM (t)] represents, and we refer to Ehret et al. (2012) for a more thorough discussion.The c GCM (t) and thus F GCM"obs [c GCM (t)] are driven dynamically by the time evolution operator of the GCM, but each image F GCM"obs [c GCM (t)] is mapped individually to look like it came from the observations.If we set c GCM (0) 5 F obs"GCM [c obs (0)] and evolve each of these states forward in time with their own time evolution operator, then we would still expect that F GCM"obs [c GCM (t)] Þ c obs (t).Although each image from F GCM"obs [c GCM (t)] should be realistic in comparison with observations, the entire sequence need not be.An example of this inconsistency could come when bias correcting wind velocities and precipitation.The wind velocities may be debiased such that it should change the velocity of a precipitation system, but in consecutive frames the precipitation pattern may not reflect this.
So, when using GCM output, we must assume that the GCM has realistic time dynamics, which would not be drastically affected by our moderate bias correction.This also emphasizes that bias correction is no cure for a poorly performing GCM and that capturing the physics of the climate system is key.

Bias correcting the South Asian monsoon
We use GCM data from the Climate of the Twentieth Century Plus (C20C1) Project (Folland et al. 2014), particularly from the HadGEM3-A-N216 GCM (Ciavarella et al. 2018), run under a historical recreation scenario (All-Hist/est1).In this dataset, the ocean temperatures are prescribed to their observational estimates and emissions are set to the historical record.Therefore, only the atmosphere component of the model is run.
We attempt to bias correct the HadGEM3 historical recreation data with respect to the ERA5 reanalysis data (Hersbach et al. 2020).Both of these datasets are daily data fields.We choose this over monthly data because extreme events like heatwaves and floods occur on the time scale from days to weeks.
We limit the geographical extent to the area bounded by 88S-308N, 448-1218E.This region was chosen to capture the South Asian monsoon.Many GCMs, including HadGEM3, have a large bias in simulating the South Asian monsoon (Bollasina and Ming 2013;Ashfaq et al. 2017), which is usually placed too far south over the Indian Ocean and leaves the landmass drier than reality (shown in Fig. 1).This bias remains when the model is run with prescribed sea surface temperatures.We chose this region as a hard case to solve for bias correction.We consider the daily 2-m mean, minimum, and maximum temperature; accumulated precipitation; and mean 500-hPa geopotential height at all grid locations.The first four of these variables are important for climate impact studies and are of primary interest.The geopotential height is a dynamical variable that can aid the accuracy of bias correction of the other variables (Pan et al. 2021), although we allow the network to correct this variable also.
Before bias correction, we conservatively regrid (Jones 1999) the ERA5 data onto the coarser grid of the HadGEM3 data, which has resolution of 0.568 latitude and 0.838 longitude.This gave a region of size 68 3 92 grid points.We also limit the two datasets to the time period in which they overlap; this is 1979-2013 inclusive, giving us 35 complete years and approximately 13 000 daily fields.We split this data into train and test sets, using the odd-numbered years for training each method and the even-numbered years to test on.This extreme split of using nearly 50% of the data test was required to perform adequate analysis of the spatial and cross-variable statistics of the results.All figures to follow were based on the test set, which was not used in training or choosing parameters for any method.
We applied quantile mapping [Eq.( 1)] to both datasets by fitting a 100-point empirical cumulative distribution function to each grid point for each month of the year and for each variable.This equates to ;75 million points estimated from the data in order to perform the bias correction with quantile mapping.For comparison, the UNIT translation network had ;38 million parameters.The form of QM we use is generally applied in situations in which the data are stationary, that is, where there is no distribution shift due to global warming.The neural network approach we choose similarly assumes stationarity.Since we limit our datasets to only a 35-yr period and split alternative years into train and test data, we do not expect this assumption to have a significant impact on the results we show.Especially as we are considering daily variability rather than monthly averages, so changes in the mean are relatively small in comparison with variability.The extensions that are added to quantile mapping to allow it to approximately debias data that are not stationary (Cannon et al. 2015) could be applied to the UNIT neural network approach with little modification.
In the following section, we compare these translations and the original datasets.In particular, we study in detail how they address the large biases in simulations of the South Asian monsoon.To test this more broadly, we also consider their performance in correcting a variety of other extreme events of societal relevance to the region.

Results
Early in our experimentation with the UNIT network, we noticed that it was biased against producing extreme values in its translations.This limitation may be because UNIT inherits features of its architecture from GANs, and GANs are known to reduce the distribution at its boundaries (Dionelis et al. 2020;Bau et al. 2019;Arora and Zhang 2017).Notably, Ravuri et al. (2021) found that using a conditional GAN for precipitation nowcasting also reduced extreme values in precipitation fields.UNIT also inherits features VAEs, which blur the image reconstructions (Snell et al. 2017) and thus reduce the extreme values.The VAE blurring is a result of the VAE architecture.These extreme values are important for climate research, so we propose to improve the UNIT network performance by following it up with QM.In the combined UNIT1QM method, we take the trained UNIT network and use it to translate the entire HadGEM3 dataset; then we train QM between this dataset and the ERA5 dataset.The same trained UNIT instance is used in both the UNIT and the UNIT1QM results presented.

a. South Asian monsoon
Figure 1 shows the means and biases of precipitation in the peak monsoon months (June-September) in ERA5, HadGEM3, and the UNIT-corrected data.This figure shows the key bias present in HadGEM3's simulated monsoon.It also shows the effects of the reduction in extremes generated by UNIT, where the average precipitation is reduced everywhere.The results for QM and UNIT1QM are not shown.By the definition of QM, their biases under this plot would be near zero (nonzero only due to sampling error between train and test data), and their means would be the same as the ERA5 field.We confirmed this by plotting.
Figure 2 shows three nonconsecutive days from the HadGEM3 validation data and their bias corrections using the three different methods.These examples show that UNIT can coherently bias correct the structure of the fields and shows the consequences of QM's lacking in this ability.In example day 1, it removes the precipitation in HadGEM3 that is characteristic of the HadGEM3 monsoon bias.It also removes the associated minor depression in geopotential height.QM maintains the spatial structure of the fields but simply adjusts the intensity; hence the QM field is left with a muted form of the HadGEM3 monsoon precipitation.In the QM data, the spatial structure is still intact, and the field produced is unrealistic.In example day 2, UNIT removes the monsoon precipitation feature and a stronger geopotential height anomaly.In this case, the dynamical conditions simulated by the GCM are truly biased and do not happen in the observations.In this case, it makes sense to bias correct the dynamical feature of geopotential height as well as the precipitation.This is an advantage over previous studies that assume the dynamical features are correct and use them without scrutiny to aid bias correcting other variables.Example day 3 shows a case where UNIT adds a precipitation system over the Bay of Bengal, showing the UNIT network can add meteorological features as well as remove them.
In Fig. 3, we show the distribution of daily accumulated precipitation from a single grid point.This grid point is located on the southern tip of India.The datasets of the three bias correction techniques of UNIT, UNIT1QM, and QM UNIT1QM method.The UNIT translation performs reasonably well on the marginal distribution, although it reduces the occurrence of both high and low extremes and shifts more of the distribution toward ERA5's central peak.Note that QM is explicitly designed to match these one-dimensional distributions at each grid point, while UNIT matches the distributions only as an emergent feature of optimizing its loss function.
Figure 4 shows the joint distribution between temperature and precipitation at the same grid location at the southern tip of India (we also sampled several others and found similar qualitative results).This plot shows that although UNIT underdisperses the data, it can capture the shape of the joint distribution well.We see that quantile mapping does not correct the joint distribution between precipitation and temperature well.UNIT1QM performs the best of the three techniques and captures both the joint distribution and the dispersion of the data.The UNIT step in UNIT1QM corrects the correlation structure, while the QM part corrects the marginal distributions.
To quantitatively assess the similarities of these joint distributions, we use Jensen-Shannon (JS) divergence.This is a common statistical measure of how different two distributions are; a lower value is better.We estimate this empirically by binning the 2D (temperature and precipitation at a grid point) data into bins {j i } and counting the frequency in each bin.The JS divergence is computed as where P and Q are the two distributions in question, and M 5 (P 1 Q)/2 is their mean.D KL (P 1 ‖P 2 ) computes the KL divergence between distributions P 1 and P 2 and is estimated empirically via We compute the JS divergence between each of the distributions in Fig. 4 and the ERA5 distribution.We form a 2D histogram by splitting the data into 20 bins of equal width in both the temperature and precipitation 1/4 dimensions.We calculate JS divergences of 0.189, 0.053, 0.020, and 0.025 for the HadGEM3, UNIT, UNIT1QM, and QM distributions with respect to the ERA5 dataset.The results were comparatively similar for a range of reasonable choices of the number of bins (see section d in the appendix).These JS divergence values are all significantly distinct from each other (see section e in the appendix).Continuing this analysis, Fig. 5 shows maps of the JS divergence of precipitation and temperature at all grid points.UNIT performed poorly as a result of how it underdisperses the data, whereas UNIT1QM performed the best.UNIT1QM appears to adjust for the correlations in the data while also maintaining the dispersion.
To assess the spatial structures of the data and translations, we start locally, looking at the joint distribution of precipitation at two neighboring grid points.We choose the same familiar grid point on the southern tip of India and the cell directly eastward of it.Again, in Fig. 6, we see that UNIT1QM performs the best, improving the structure of the joint distribution and also the dispersion of the data.UNIT reduces the dispersion, and QM does not adequately correct the correlation structure.
We wish to assess whether the daily fields produced by each method of bias correction are realistic and whether the spatial structures observed are biased.To do so, we use the mean structural similarity index measure (SSIM) (Wang et al. 2004), which is a metric designed to measure how structurally similar two images are.SSIM takes into account the difference in mean values, the contrast in the image between high and low values, and the structure in the image, that is, whether high and low values are in the same locations.As is common, we use the mean SSIM, and these comparisons are made for each grid point using an 11 3 11 pixel Gaussian sliding window.Then the average is taken over the image.A larger value shows a closer match between images, with a maximum possible similarity of 1.
To assess whether the datasets produce fields that are spatially realistic, we propose running the following computation: Take one daily field of a single variable from a dataset x 2 {HadGEM3, UNIT, UNIT1QM, QM} and find the daily field from the ERA5 test dataset that is most similar using SSIM.This is a similar matching algorithm to flow-analogs as developed in Yiou et al. (2007).Store this optimal value of SSIM.Repeat for all days in dataset x.
Then we plot the distribution of these best-match SSIM values.Figure 7 shows the distributions calculated by matching on geopotential height, mean daily temperature, and the fourth root of precipitation.We chose to use the fourth root of precipitation as the raw marginal distribution has a very long tail, and we wish to avoid extreme values dominating the comparison.Instead, we are trying to focus on the overall spatial structure.In the figure, the farther the distribution is to the right the more similar the fields in the dataset were to the ERA5 fields.In the figure, we also include the results of comparing fields from the ERA5 train dataset with the test There are some limitations to this matching method that should be noted and can explain why UNIT outperforms UNIT1QM here.The UNIT translations are underdispersed, as we have seen in previous figures.This means each UNIT field lies more toward the center of the ERA5 distribution.Fields from any dataset x that lie toward the center of the ERA5 distribution are more likely to find an ERA5 field that matches them closely than fields from x that are toward the tails of the ERA5 distribution.This is simply because there are more ERA5 fields to choose from in dense regions.The best-match SSIM between a field x and the ERA5 data is positively associated with the density of the ERA5 data p ERA (x) around the location x.This explains why UNIT performs best in this analysis.We note that UNIT1QM performed second best and the distribution of UNIT1QM is not underdispersed like UNIT.In section i of the appendix, we perform extra analysis that confirms that UNIT's exaggerated performance in this analysis is likely due to underdispersion.
We repeated the computation above using mean absolute difference and mean square difference as alternative metrics to SSIM and achieved qualitatively similar results.
Furthermore, we examine whether the characteristics of these methods hold when we spatially aggregate the data.We aggregate to a few river basins in this region, using data from the World Bank data catalog (World Bank 2019) to define the basin boundaries.We aggregate to the portion of these basins that lie within the spatial extent of our data, and, therefore, two of the basins are clipped (see section f of the appendix for plotted basin masks).
Figure 8 shows the joint distribution between the spatial mean precipitation and temperature in the Mahanadi River basin.This basin was chosen as there is a big difference between the HadGEM3 and ERA5 joint distributions, and, therefore, the bias correction method has a lot to correct.Debiasing this distribution involves correcting spatial correlations (since it is a spatial aggregate) and cross-variable correlations simultaneously.Once again, we can see that UNIT1QM performs well, correcting the cross-variable correlations and the dispersion.
In Fig. 9, we plot the joint distribution between the spatial mean of precipitation in the Ganges-Brahmaputra and Indus basins.These are two important basins, both impacted by the South Asian monsoon and geographically separated.This kind of long-range bias correction could be important for assessing the risks of multiple breadbasket failure.None of the methods performed particularly well in this analysis.

b. Other extremes
In the second part of the results section, we study the performance of these bias correction techniques for other extreme events of societal relevance in the South Asia region.In particular, we study whether the combined UNIT1QM improves the representation of physical extremes.We consider three cases, testing the cross correlation of different variables and spatial areas.
First, we analyze the relationship between temperature and pressure on the hottest day of each year over a region in central India (88-288N, 728-858E).We select the day with the highest average maximum daily temperature over this region in each year.South Asia experiences some of the most extreme humid heat on the planet (Raymond et al. 2020), and combined with high population density and vulnerability to such hazards, this poses a severe threat to human health (Im et al. 2017).Furthermore, even despite the cooling influence of anthropogenic aerosols over the region, recent events such as the heatwave of 2015 have been amplified by climate change (Wehner et al. 2016).Thus, the understanding of such events is becoming increasingly pertinent.
Figure 10 shows the joint distributions between daily maximum temperature and 500-hPa geopotential height at all grid points across all annual hottest days.In this case, UNIT1QM appears to perform the best, though with only 17 annually hottest days to analyze, so this could be quite sensitive to noise.
Results of the JS divergence for all of the various extreme distributions are presented in Table 1.In this first case, UNIT1QM performs best, though not significantly better than QM.
Second, we analyze the relationship between temperature and precipitation over the Ganges-Brahmaputra River basin during the wettest continuous 30-day period of each year.7)] of the HadGEM3, UNIT, UNIT1QM, and QM datasets with respect to the ERA5 dataset.The JS divergence is calculated at each grid point between the two-dimensional precipitation and temperature distributions using 20 bins in each dimension.
Flooding, driven partly by rainfall excesses or deficits, has severe impacts on the region.Between 2000 and 2020, floods caused over USD 100 billion in damages and the deaths of more than 49 000 people}almost one-half of the global flood mortality in the period}according to disaster database EM-DAT (Guha-Sapir et al. 2014).Although this is largely driven by the behavior of the monsoon, impactful rainfall extremes also occur outside of this season and are also influenced by anthropogenic climate change (Rimi et al. 2019).When widespread flooding occurs, simultaneous FIG. 6. Joint distributions of daily accumulated precipitation at a grid point located at the southern tip of India (centered at 8.68N, 77.98E) and its neighboring cell directly eastward (centered at 78.88E) for the five different datasets.Each point is from a single day, and the contours show the joint density as estimated by kernel density estimation.The x and y scale in all subplots is the fourth root of precipitation.This was chosen so that the differences in the distributions can be seen.It is also clipped at an upper value of 75 mm day 21 .extreme heat may result in compounded impacts, such as through disrupted water supplies and water-and insectborne disease (Levy et al. 2016;Moors et al. 2013).
Figure 11 shows the joint distributions between the 30-day averages of daily maximum temperature and precipitation, for the wettest 30-day period of each year, at all grid points in the Ganges-Brahmaputra basin and for all years.
Across the basin there are two temperature regimes due to the change in altitude from low-lying Bangladesh and northeast India to the Tibetan Plateau.Over the low-lying warmer region, HadGEM3 shows severe discrepancies in rainfall, likely owing to its poor monsoon representation and its tendency to drizzle.UNIT underdisperses the translated data, and both QM and UNIT1QM perform well.The results of JS divergence (Table 1) show that combining the two techniques captures the reanalysis data most effectively, with UNIT1QM significantly outperforming the other methods.
Third, we analyze the relationship between precipitation in the Indus and Ganges-Brahmaputra River basins, following on from earlier analysis (Fig. 9), to measure not only spatial but also temporal variation between the basins.Combined, these rivers are of crucial importance to agriculture and thus food security in the region.Accurately simulating the possibility of co-occurring flooding or flash drought in the two is, therefore, pivotal.To measure excesses and deficits in rainfall over time, we use the standardized precipitation index, which is commonly used to monitor drought and flood hazards (Chandrasekara et al. 2021;Aadhar and Mishra 2017;Tirivarombo et al. 2018).This index is defined by where P T is the mean precipitation over a time period length T, P * is the mean value of all P T across the dataset, and s P T is FIG. 7. Distributions of the SSIM between each dataset field and its most similar ERA5 field, shown for the mean 2-m temperature, mean 500-hPa geopotential height, and the fourth root of daily precipitation.
its standard deviation.Thirty days is used as the baseline time period to represent medium-term extremes in precipitation, relevant for both flooding and subseasonal flash drought (Christian et al. 2021;Mishra et al. 2021).
Figure 12 shows the joint distributions between co-occurring SPI values in each river basin for each model and bias correction system.We note that the data points in this plot originate from a 30-day rolling time window instead of selecting independent 30-day periods.To turn our rolling-window data into independent samples would mean selecting 1/30 of the time indices, that is, indices {30i 1 s} i2[0, 1, 2, 3, … ] where s is the starting index 0 , s , 29.However, we are only interested in estimating FIG. 8. Joint distributions of the mean daily accumulated precipitation and daily mean temperature aggregated over the Mahanadi River basin for the five different datasets.Each point is from a single day, and the contours show the joint density as estimated by kernel density estimation.The x axis has power-law scaling.This was chosen so that the differences in the distributions can be seen.It is also clipped at an upper value of 75 mm day 21 .the joint density, and all starting indices s are equally valid.So, we decide not to filter to independent periods in the density plot and effectively marginalize over s when performing kernel density estimation in the figure.However, we note that the datapoints are oversampled, so we must avoid overinterpreting point clusters.An independently sampled version of this plot is presented in the appendix (Fig. A8) and shows the same overall structure in each case.
HadGEM3 underestimates the co-occurrence of very wet events where the Ganges-Brahmaputra SPI is greater than 1 and Indus SPI is greater than 2. All translations perform reasonably well at correcting for these edge cases, but again UNIT1QM FIG. 9. Joint distributions of mean daily accumulated precipitation across the Ganges-Brahmaputra and the Indus basin for the five different datasets.Each point is from a single day, and the contours show the joint density as estimated by kernel density estimation.The x axis and y axis have power-law scaling.This was chosen so that the differences in the distributions can be seen.It is also clipped at an upper value of 75 mm day 21 .performs best.This is evident from the JS divergence values and visually; while UNIT underdisperses the data, using UNIT1QM provides an accurate correction for more extreme events.

Conclusions
In this study, we examined the appropriateness of unpaired image-to-image translation networks to bias correct climate data.We found that the UNIT neural network architecture was not sufficient by itself for bias correcting data.This was because it reduced the dispersion of the data and, therefore, led to less extreme values than expected.We showed that the bias correction produced by UNIT did have desirable properties, such as coherently bias correcting cross-variable correlations and spatial structures.We proposed to combine the UNIT translation with quantile mapping, a more traditional bias correcting technique}a combination that is similar to previous work (Franc¸ois et al. 2021).We found that applying these techniques in sequence made up for shortcomings in each.UNIT was able to bias correct the spatial and cross-variable correlations, while QM corrected UNIT's tendency to underdisperse the data.
When applied to bias correct the daily minimum, mean, and maximum temperature, the precipitation, and geopotential height, TABLE 1. JS divergence values for each model relative to ERA5 across three joint distributions of extreme events: temperature and pressure on the hottest days of the year over central India (Fig. 10), temperature and precipitation over the wettest continuous 30 days of each year over the Ganges-Brahmaputra basin (Fig. 11), and the standardized precipitation index in the Ganges-Brahmaputra and Indus River basins (Fig. 12).These JS divergence values are all significantly distinct from each other}see section e of the appendix.

Model
Hot days (Fig. 10) Wet periods (Fig. 11) Two basins SPI (Fig. 12 we found that UNIT1QM was able to simultaneously correct all variables.This is an advantage of previous work where the dynamical feature of geopotential height is assumed to be correct and unbiased and thus is used as a key part of regularizing the bias correction (Pan et al. 2021).In our method, we leave it to the discriminator network to ensure that the temperatures and precipitation are consistent with each other and the dynamical variable.This allows the network to bias correct the dynamical variable where it is appropriate.
Designing bias correction methods such as UNIT1QM, which can bias correct many variable joint distributions, is crucial to study and make risk assessments of compound extreme events.Such cross-variable corrections are also important when the output of GCMs are used as boundary conditions of regional climate models (White and Toumi 2013).We have shown that incorporating modern machine learning methods alongside classical techniques can provide us with more powerful tools for bias correction.
Further work on this topic may be needed in several aspects: 1) to consider the performance of this approach relative to newer developments in statistical bias correction such as multivariate quantile mapping (Cannon 2018), alongside computational demands, as well as how the approaches could be further combined to address the issues inherent in both, and 2) to consider bias correction of nonstationary data.We note that this is also an ongoing challenge for classical techniques.Many of the extensions to classical techniques used for nonstationary data, such as detrending either additively or multiplicatively (Cannon et al. 2015), could also be used with UNIT1QM.
Acknowledgments.Author Fulton was supported by NERC Doctoral Training Partnership Grant NE/L002558/1.Author Clarke was supported by NERC Doctoral Training Partnership Grant NE/L002612/1.Author Hegerl was funded by NERC Large Grant GloSAT (NE/S015698/1).Computing resources for this work were provided by a Microsoft AI for Earth grant.We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6.We thank the climate modelling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF.We also thank Simon Tett, Massimo Bollasina, and Friederike Otto for helpful discussions.
Data availability statement.All data used in this work come from the Climate of the 20th Century Plus (C20C1) Project (Folland et al. 2014) and are freely available.The code is available online (https://github.com/dfulu/climateTranslation).

Network Details and Supporting Analysis a. Preprocessing the data
As is ubiquitous in training machine learning (ML) models, we preprocessed the data before training the UNIT network.
The daily mean temperature fields originally in kelvins were converted to Celsius and then divided by the mean of the standard deviation of temperature across all latitudes and longitudes.This maintained the physically significant value of 08C while avoiding excessive gradients when updating the model parameters.The daily maximum and minimum temperatures were converted to degrees Celsius above or below the daily mean temperature.This allowed us to choose ReLU activation functions that enforced T min , T mean , T max .They were also divided by the spatially mean standard deviation in temperature.
Precipitation values are extremely skewed and non-Gaussian.This can be challenging for an ML model to learn.In keeping with ML best practices, we transformed the data so that the distribution became closer to Gaussian.We found that taking the fourth root of the daily accumulated   precipitation appeared reasonably dispersed.Then we divided the data by its standard deviation.We note that the transforms that we used need not make the data exactly Gaussian but merely less extremely distributed.
We simply standardized the 500-hPa geopotential height by subtracting the mean and dividing by the standard deviation.The means and standard deviations used were not calculated pointwise; instead, a single value was used globally.

b. UNIT network structure
Figure A1 shows a diagram of the UNIT architecture used here.We use color and shape to represent different layers and subcomponents of the network.
The encoders E(x) are composed of seven convolutional layers with convolutional downsampling to reduce the spatial dimension.The decoders G(z) are composed of nine convolutional layers with nearest neighbor upsampling to increase spatial dimension.Nearest neighbor upsampling was used over transpose convolutions to avoid checkerboard artifacts Odena et al. ( 2016) that are present in the original UNIT network.Some of these layers are organized into residual skip connections (He et al. 2016), which allow more routes for gradient backpropagation to earlier layers.
We use a multiscale discriminator in each domain.Isola et al. (2017) found that such multiscale discriminators produce images that have more realistic large-scale features and sharper small-scale features.The multiscale discriminator is composed of three discriminator components that assess the plausibility of the translations at separate spatial scales.We use spatial scales that are 13, 23, and 43 as coarse as the input image.Each discriminator component had six convolutional layers.For the discriminators that are coarser than the original image, the input image is coarsened by mean pooling.
Batch normalization and leaky ReLU activation functions were used throughout these components, as shown in the diagram.The final layer of the decoder G(z) used different activation functions for the different channels.The mean temperature and geopotential height used none (i.e., x ´x); the precipitation, and maximum and minimum temperatures (expressed as difference from mean temperature) used ReLU activation functions.Where padding was used, it was replication padding.The number of filters used in each layer is shown in the diagram.
The bottom panel of the diagram shows the overall structure of the UNIT network.It also shows the path of an image x from domain X through the network components to generate its reconstruction (x x ), its translation to domain Y (ŷ x ), and its cyclic reconstruction (x xy ).This panel also shows where in the path the translation is passed into the multiscale discriminator.
This panel also shows the use of the land mask.Early in the study we noticed that the UNIT network was producing unphysical translations at land-sea boundaries, such as negative daily temperature ranges (this was noticed before we added some of the preprocessing and activation constraints mentioned in Fig. A1).This occurred exclusively on land-sea border pixels.To aid the network in translating these border regions effectively, the encoder E(x) and discriminator D(x) networks were given the concatenated weather fields and binary land-sea mask [x, mask] as input.The decoder G(z) was only trained to reconstruct images xGCM/obs from the latent encoding z and ignored the land-sea mask.

c. Training and hyperparameters
The hyperparameters used in training this network were taken from those used in the original UNIT network and adjusted manually a little using only the results of the training data.These were not highly optimized, so there may be room for improvement.
We trained the network from scratch using a batch size of 8 and the Adam optimizer (Kingma and Ba 2014) with learning rate of 5 3 10 24 for both the translation network and the discriminators.In the Adam optimizer, we also used weight decay, b 1 , and b 2 values of 0.0001, 0.5, and 0.999, respectively.No explicit regularization was used in training.The only regularization present in the network is the implicit regularization associated with the reparameterization step in the VAEs.
We trained the network for 208 000 iterations.As with many adversarial networks, training is very unstable, and so the model was checkpointed regularly (every 2000 iterations) throughout training.We chose the 208 000 iteration checkpoint manually although training continued to 570 000 iterations.We chose this checkpoint because the loss was reasonably low and stable at this point.The network took a few days to train on a single NVIDIA Tesla T4 graphical processing unit (GPU).
The loss coefficients used in Eq. ( 6) are shown in Table A1.These are similar to those used in the original UNIT network.
As noted in the main text, we use a weighted L1-norm loss function for the image reconstructions in Eq. ( 2).This was motivated to emphasize the precipitation fields in the translation as these have the most complex distributions.We weight the L1 loss calculated separately for each channel such that where L 1 is the total weighted L1 loss for a sample, L 1c is the L1 loss for a channel c, and w c is the weight for that channel.We set the weight of the precipitation channel to 5 and the weight of the other four channels to 1.

d. Empirically calculated JS divergence
In this study, we have used an approximation of JS divergence [Eq.( 7)] that can be used on binned data.Figure A2 shows the sensitivity to the number of bins of the spatial mean of the JS divergence between precipitation and temperature calculated at each grid point (i.e., the mean value across Fig. 5 for each dataset).This shows that our results are qualitatively robust to the choice in the number of bins in this estimation.

e. Significance of JS divergence values
In the main text, we quote the values of the JS divergence calculated between each of the distributions in Fig. 4. We estimate the significance of these values via bootstrapping.
We randomly sample N fields with replacement from the ERA5 dataset, which is also of length N. We sample from each dataset x 2 {HadGEM3, UNIT, UNIT1QM, QM} similarly (we also sample again independently from ERA5 for results in Table A2).Then we calculate the JS divergence between these bootstrapped samples.We repeat this bootstrapping calculation 200 times.The 5th, 50th, and 95th percentiles, as well as the values quoted in the main text that did not use bootstrapping, are collated in Table A2.
We note that the JS value calculated using all of the data is consistently lower than even the 5% value using bootstrapped samples.This is simply because the bootstrapped samples are less diverse than the full dataset, and so it can only be expected that the match between the distributions will get worse.We consider one data source to be significantly better than the other if the 50% bootstrapped JS divergence of that dataset is lower than the 5% of the other data source.This suggests that UNIT1QM is a significant improvement on the other methods in this statistic.
Table A3 shows similarly calculated confidence intervals calculated for the JS divergences presented in Table 1.

f. Basins used
Figure A3 shows the areas used for each basin used in this study.The mask was created by checking whether the center of each grid point is contained within the boundaries of the basin shape downloaded from the World Bank data catalog (World Bank 2019).
As noted, two of these basins are clipped because of the boundaries chosen (88S-308N, 448-1218E).In this study, we only use the results of aggregating these basins as a demonstration of how each method of bias correction performs when aggregated over a spatial area.Therefore, clipping these basins to the area in our domain does not affect the contents of the work presented here.

g. Plots repeated as copulas
To assess the joint distribution of the translations without the influence of the marginal distributions, we recreate some of our figures from the main text after transforming the data values into quantiles.This produces an empirical estimate of the copula of the two-dimensional datasets plotted.Figures A4-A7 are copula alternatives to Figs. 4, 6, 8, and 9, respectively.
The results of Figs.A4-A7 support what we observed in the figures in the main text.The UNIT1QM method performs the best across the four figures.The UNIT1QM copula is more similar to the ERA5 copula than those of the other methods in all the figures.QM does not adequately correct the copula of the HadGEM3 data in any of the figures.
When a dataset is transformed via regular QM, its copula is unaffected.Therefore, we might expect the copulas of the HadGEM3 and QM datasets to be identical in each of the figures.Similarly, we might expect the UNIT and UNIT1QM data to have identical copulas.However, we apply quantile mapping to each month separately.This makes the QM we apply a conditional QM.This method can, therefore, modify the copula, not just the marginal distributions when considered over the whole year.
In Figs.A6 and A7 there are no data points in the gap between 0 and around 0.1-0.2 on the x axis.This is because around 10%-20% of the basin average precipitation for the Mahanadi and Indus basins, there is exactly zero.Hence these 10%-20% of data points are assigned a quantile of zero.The Ganges-Brahmaputra basin is large enough that almost no days have exactly zero precipitation across the basin, so no gap is observed in the y axis in Fig. A7.

h. Independently sampled SPI events
Figure A8 shows independently sampled values of SPI from the Ganges-Brahmaputra and Indus River basins.The overall structure of the distributions is very similar to Fig. 12, in which values are calculated from a rolling time window.

i. Alternative matching algorithm
We examine the opposite match algorithm to the one used to produce Fig. 7. Instead of taking fields from dataset x 2 {HadGEM3, UNIT, UNIT1QM, QM} and finding the closest match in ERA5, we match the opposite way.We take From the SSIM match distributions, we may be left asking how often each individual sample is chosen as the best match.If the translation is done well, we would expect that many samples are chosen as the best match rather than the same sample always being matched.Figure A10 (based on the fourth root of matching precipitation) and Fig. A11 (based on matching temperature) show the results of this analysis.We count how many times each sample is chosen as the best match and then plot the frequencies ( y axis) of these best match counts (x axis).For example, in Fig. A10, we can see that when we take HadGEM3 fields and select the best match among the ERA5 data, around 1500 of the ERA5 fields were chosen as the best match exactly one time.A little over 400 ERA5 fields were chosen as the best match exactly two times, and so on.In the figures, the black lines marked as ERA5 " ERA5 are the results of matching the training ERA5 data to the validation ERA5 data.Each panel of these figures also shows the fraction of data that was matched to at least once.
In Fig. A10, we see evidence of the underdispersion of UNIT.When UNIT is matched to ERA, a few of the ERA5 fields are chosen many times, with one ERA5 field being the best match to around 400 UNIT fields.In the UNIT1QM panel, we see that the match frequencies are similar to the ERA-to-ERA baseline regardless of whether we match ERA5 to UNIT1QM or UNIT1QM to ERA.FIG.A10.Distribution of the number of times matched to fields were chosen as the best match using SSIM to compare the fourth root of precipitation.The x axis is the number of times a field was chosen as the best match (the match count).The y axis is the number of fields with this match count.The figure shows the results of matching in both directions between ERA5 and the bias correction methods.The black line is the result of performing the same SSIM matching based on matching the ERA5 training set to the ERA5 test set and is included for comparison.
Figure A11 shows similar features to Fig. A10 but with these features less pronounced.This may be because temperature has a less complex spatial joint distribution than precipitation.

FIG. 1 .
FIG.1.Mean precipitation (mm day 21 ) during peak monsoon months (June-September) for each dataset and their bias with respect to ERA5.

FIG. 3 .
FIG. 3. Distributions of daily accumulated precipitation from a single grid point located at the southern tip of India (centered at 8.68N, 77.98E).Five distributions for the same location are plotted from the five different data sources.The normalized histogram is on an x scale of the fourth root of precipitation so that the differences in the distributions can be seen.It is also clipped at an upper value of 75 mm day 21 .

FIG. 4 .
FIG. 4. Joint distributions of daily accumulated precipitation and daily mean temperature from a single grid point located at the southern tip of India (centered at 8.68N, 77.98E) for the five different datasets.Each point is from a single day, and the contours show the joint density as estimated by kernel density estimation.The x scale in all subplots is the 4 root of precipitation.This was chosen so that the differences in the distributions can be seen.It is also clipped at an upper value of 75 mm day 21 .

FIG. 5 .
FIG.5.Maps of estimated JS divergence [Eq.(7)] of the HadGEM3, UNIT, UNIT1QM, and QM datasets with respect to the ERA5 dataset.The JS divergence is calculated at each grid point between the two-dimensional precipitation and temperature distributions using 20 bins in each dimension.

FIG. 10 .
FIG. 10.Joint distributions of the daily maximum temperature and 500-hPa geopotential height at each grid point over central India, bounded by 88-288N, 728-858E, on the hottest day of each year (defined by the spatially averaged daily maximum temperature) for the five different datasets.Each point is from a single day and grid point, and the contours show the joint density as estimated by kernel density estimation.
FIG. 11.Joint distributions of the daily maximum temperature and accumulated precipitation at each grid point over the Ganges-Brahmaputra River basin over the wettest 30-day period each year, defined by the spatially averaged accumulated precipitation, for the five different datasets.Each point is from a single grid point, and the contours show the joint density as estimated by kernel density estimation.

FIG. 12 .
FIG.12.Joint distributions of the standardized precipitation index at each grid point over the Ganges-Brahmaputra and Indus River basins using a 30-day time window for the five different datasets.Each point is from a single grid point, and the contours show the joint density as estimated by kernel density estimation.
FIG. A3.Geographical extent of the key areas and basins used in this study.The red outline denotes the overall region used in this study.For the basins that extend outside this red-outlined box, we have aggregated to only the area of the basin that lies inside this region.The black bounding box denotes the area bounded by 88-288N, 728-858E, used to study extreme heat events as in Fig.10.

FIG. A6 .
FIG. A6.Copula decomposition of Fig. 8}Joint distributions of the mean daily accumulated precipitation and daily mean temperature aggregated over the Mahanadi River basin for the five different datasets.Each point is from a single day, and the contours show the joint density as estimated by kernel density estimation.The x and y scales show the precipitation and temperature as expressed as quantiles calculated with respect to each dataset.FIG.A5.Copula decomposition of Fig. 6}Joint distributions of daily accumulated precipitation at a grid point located at the southern tip of India (centered at 8.68N, 77.98E) and its neighboring cell directly eastward (centered at 78.88E) for the five different datasets.Each point is from a single day, and the contours show the joint density as estimated by kernel density estimation.The x and y scales show the precipitation at the two grid points as expressed as quantiles calculated with respect to each dataset.
FIG. A8.Joint distributions of the standardized precipitation index at each grid point over the Ganges-Brahmaputra and Indus River basins using a 30-day time window for the five different datasets.Each point is from a single grid point and a completely independent 30-day time window.FIG.A7.Copula decomposition of Fig. 9}Joint distributions of mean daily accumulated precipitation across the Ganges-Brahmaputra and the Indus basin for the five different datasets.Each point is from a single day, and the contours show the joint density as estimated by kernel density estimation.The x and y scales show the precipitation in the two basins as expressed as quantiles calculated with respect to each dataset.
FIG. A9.Distributions of the SSIM between each ERA5 field and its bestmatched dataset field, shown for the mean 2-m temperature, mean 500-hPa geopotential height, and the fourth root of daily precipitation.

TABLE A1 .
The coefficients used in the UNIT loss function.
FIG. A2.The spatial mean of the JS divergence between temperatureand precipitation at all grid points.

TABLE A2 .
The values of the JS divergence calculated between each of the distributions in Fig.4as mentioned in the main text and their bootstrapped percentile intervals.

TABLE A3 .
Bootstrapped confidence percentile intervals calculated using the same method as those in TableA2, but for JS divergence values presented in Table1.