1. Introduction
As society tries to adapt to Earth’s changing climate, access to accurate, local-scale climate information is essential. Earth system models (ESMs) provide state-of-the-art projections on a global scale but provide insufficient spatial resolution for regional analyses. Thus, downscaling—creating high-resolution climate information from low-resolution, large-scale data—is an important practical tool. Spatially and temporally high-resolution fields of climate data over large scales are important for many applications, including simulation of extremes at local scales (e.g., fires, floods, storms; Fischer et al. 2021), local planning, and making climate-informed ecological decisions such as tree species selection (MacKenzie and Mahony 2021). Hence, accurate and computationally efficient downscaling methods are crucial for local climate adaptation.
Strategies for downscaling coarse-resolution climate model simulations to regional scales can be broadly classified into dynamic downscaling and statistical downscaling. Dynamic downscaling employs a limited-area numerical climate model to resolve fine-scale features, driven by large-scale weather patterns from the low-resolution ESM (Skamarock et al. 2001). Dynamic downscaling can capture complex spatial patterns at smaller scales and can provide high-accuracy downscaling. However, it is highly computationally intensive and can only be used for relatively short time periods. Statistical downscaling develops statistical relationships between low-resolution (LR) and high-resolution (HR) climate variables. Statistical downscaling can be applied either at individual points, often using a combination of bias correction and transformation of statistical moments (Maraun 2013), or can be used to downscale entire fields. Common techniques for the latter include parametric approaches (e.g., Gaussian process/kriging), where covariances are specified to allow analytic solutions. While these approaches have shown some success, such climate-field downscaling methods make strong assumptions about the distribution and homogeneity of statistics, which are often not satisfied (Chen et al. 2014). Also, many current methodologies struggle to accurately downscale spatially complex variables (i.e., variables with nonlinear dependence on elevation or spatially heterogeneous dependence structures) and to capture extremes (Harris et al. 2022). Recently, a new strategy for statistical downscaling of climate fields has been developed that uses deep learning algorithms to learn a mapping from LR to HR paired climate fields (Annau et al. 2023). It has been found that this strategy can produce downscaled fields with much higher accuracy than traditional statistical downscaling and does not require the prohibitive computation required for dynamic downscaling.
Downscaling is intrinsically an underdetermined problem with a distribution of possible HR realizations physically consistent with any given LR input (Afshari et al. 2023). This is especially true since weather is sensitive to initial conditions, small differences of which can result in drastically different outcomes through the development of internal variability (Lucas-Picher et al. 2021). Stochastic weather generators, which attempt to sample from the distribution of weather states, have been used to try and account for this variability (Wilks 2010). Being able to capture the full variability of a downscaling problem is crucial for quantifying uncertainty and for characterizing extremes. Ideally, downscaling techniques would allow sampling from the HR distribution, conditioned on the LR input.
Deep learning methods are a promising approach for such distributional downscaling problems. Generative adversarial networks (GANs) have been successful in various generative artificial intelligence (AI) fields, especially computer vision applications (Goodfellow et al. 2014). GANs generally use two separate deep neural networks during training: a generator network which is given input and attempts to create plausible counterfeits of the training data, and a discriminator or critic network which is provided training data mixed with generator output and attempts to distinguish between the counterfeits and the real data. During training, the two networks play a minimax game: The generator tries to improve its output to “fool” the discriminator, and the discriminator tries to improve its ability at distinguishing between real and generated samples. In the last few years, GANs have been introduced to deep learning–based downscaling and have shown success in drawing realizations from high-dimensional non-Gaussian distributions with complicated dependence structures. Conditional GANs, developed by Mirza and Osindero (2014), allow the GAN to draw realizations from distributions conditioned on covariates.
Much of the development of GANs for climate downscaling builds on work from the computer vision field of super resolution. Most studies in computer vision use conditional GANs, where the networks are provided LR information and learn to sample from the HR conditional distribution (Ledig et al. 2017). With the introduction of GANs to climate downscaling, difficulties with instability during training (Wang et al. 2020) were addressed by the introduction of the Wasserstein GAN (WGAN; Arjovsky et al. 2017). In the WGAN, instead of using a discriminator network to estimate the probability of individual realizations being real, a critic network estimates the Wasserstein distance between the true HR distribution and the generated distribution. Intuitively, the Wasserstein distance is a critic score, specifying the quality of the downscaled fields compared to the training data. Not only does this approach substantially improve stability during training, but for downscaling, it conceptually makes sense to focus on convergence in distribution of generated and truth fields.
The initial formulation of (unconditional) GANs used a stochastic approach where the only input to the generator was Gaussian noise, generating different realizations for each different noise input. With the development of conditional GANs (Mirza and Osindero 2014; Ledig et al. 2017) and the subsequent super-resolution GAN (SRGAN) and enhanced super-resolution GAN (ESRGAN) frameworks from the field of super resolution, the noise input was replaced by the conditioning fields, leading to a semideterministic network. In this setting, each trained generator would still draw a realization from the conditional distribution, but it would always draw the same realization for each set of conditioning fields (in principle, one could draw a different realization by training a new model). A few studies successfully used variations of this architecture for downscaling climate fields; for example, Stengel et al. (2020) adapted the SRGAN model to create very high-resolution downscaled wind fields by first downscaling to a moderate resolution and then further downscaling to the final HR.
Recently, studies have investigated methods for allowing explicit stochasticity in conditional GANs, usually using variations of adding noise covariates stacked with the LR conditioning fields (e.g., Miralles et al. 2022). Price and Rasp (2022) concatenated a noise layer partway through their generator network, while Harris et al. (2022) concatenated multiple noise inputs with the conditioning information at the beginning of the network. Both studies found that the stochastic results were underdispersive: Trained models were unable to capture the full range of variability, often only sampling from the center of the conditional distribution. Recent advancements in super resolution (e.g., nESRGAN+; Rakotonirina and Rasoanaivo 2020) have improved stochastic calibration but have not yet adapted it to climate downscaling. Furthermore, many downscaling studies using stochastic GANs have focused their analyses on image quality. While Harris et al. (2022) performed a comprehensive evaluation of stochastic calibration of downscaled precipitation using continuous ranked probability score (CRPS) and rank histograms, the general issue of correcting underdispersion in generated ensembles remains an open question.
We aim to fill this gap by improving stochastic GAN frameworks for climate downscaling. Most of this work builds on Annau et al. (2023). While their model showed success for downscaling wind fields, it was not stochastic; for each set of conditioning fields, only a single realization of the HR field was generated. We use a similar model architecture as Annau et al. (2023) adapted for full stochasticity. An obvious challenge with testing distributional quality in a downscaling setting is the lack of a truth conditional distribution, as in most applications we only have access to a single truth realization for each time step. To address this challenge, we first consider an idealized approach based on synthetic data with known distributional properties. Based on these experiments, we test a “noise injection” method, where hundreds of noise fields, at different spatial resolutions, are injected into the latent layers of the network. This approach provides excellent stochastic calibration on the synthetic data. We then test our modification on a real-world downscaling problem, predicting HR wind components from LR conditioning data. Challenges with underdispersion on the wind data lead to the development of an updated loss function using a probabilistic error function and modification of the training method to fully utilize the stochasticity. Our final model is successful at capturing variability and improves estimates of moderate extremes.
2. Methods
All models in this paper use the same basic super-resolution structure. We train the models on paired sets of LR conditioning fields (covariates) and HR truth fields. The GAN then learns a mapping to the HR fields from the input covariates. For consistency, we keep the same resolution and size of fields across all models: HR fields are 128 × 128 pixels, and LR fields are 16 × 16 pixels, resulting in a downscaling factor of 8.
a. Data
1) Synthetic data
To create the LR input fields, we spatially averaged 8 × 8 regions of the HR fields.
For all synthetic data experiments, training used 5000 pairs of fields; 2000 additional pairs were reserved for testing. To compare marginal distributions, we created a set of 500 fields with the same large-scale spatial pattern (i.e., the same x and y scale and rotation), so that the only difference was the added Gaussian/mixture field. These could then be interpreted as ensembles of truth realizations given the same conditioning field and were used to test generated pixelwise marginal distributions. Note that since all LR input fields were created by coarsening the HR fields, these datasets result in pure super-resolution setups, with no need to bias correct the LR inputs.
2) Convection-permitting regional model case study
Since this research aims to improve deep learning–based downscaling of climate data, it is important to test results on more realistic settings. Here, we modeled HR zonal u and meridional υ wind components using LR wind components, pressure, temperature, and HR topography. Our architecture follows that of Annau et al. (2023) with the exceptions that (i) HR topography is included as a covariate in the generator and (ii) all covariates are also passed to the critic. We did not explicitly include seasonal or diurnal variables, as the necessary information should be available to the model through the suite of covariates used (wind components, temperature, and pressure). Wind is an important climate variable for various applications, but it is often challenging to model due to having complex mesoscale patterns. We consider a square region covering the coastal mountains in southwestern Canada (49°–53°N, 122°–126°W), as its high degree of topographic complexity represents a realistically challenging downscaling scenario. HR wind fields were obtained from WRF runs produced for the western Canada (WCA) simulation driven by ERA-Interim (Li et al. 2019), which contains hourly data for a 14-year period from 2001 to 2015. LR covariates were from ERA5 (Hersbach et al. 2020), a state-of-the-art global reanalysis product. While ERA-Interim was used to drive the WRF model at the boundary conditions, both ERA5 and ERA-Interim represent the same realization of the climate system, so it is reasonable to use ERA5 in the paired LR data.
This HR and LR pairing represents a practical application of downscaling, where the LR and HR fields are from different models (as in Annau et al. 2023). Many previous studies in deep learning–based downscaling have used idealized pairings, where the LR fields are created by coarsening the HR fields, resulting in perfectly matched pairing (note that this was the approach we used for the synthetic data experiments). As a consequence of natural internal variability, some of the meteorological features on scales common to both resolutions will differ between the LR and HR fields, so our model has to account for such differences as part of the downscaling process.
To preprocess the data, we first transformed the WRF fields to the ERA5 projection and then remapped the HR fields to the specified downscaling factor. We then standardized the data to mean zero and unit variance across time and space, and within each covariate, as a standard normalization technique in machine learning studies (Annau et al. 2023). Standardization is uniform across the domain and across all time steps—i.e., a single scale and offset factor are used for each variable. Finally, we selected three apparently unexceptional years [no El Niño–Southern Oscillation (ENSO) or evident seasonal extreme wind events] for training (2003, 2008, and 2013), 2 years for validation during training to check convergence, and used the remaining 9 years of data as an unseen test set for running analyses. Hourly data over 3 years resulted in 26 304 samples; this training sample number was limited by computational constraints. Data were randomly shuffled during the training process. All models were trained on a single NVIDIA RTX 4090 GPU with 24 Gb VRAM.
b. Model
This work utilizes conditional GANs (Mirza and Osindero 2014), which have shown success at learning the mapping between low-resolution variables and the desired output variables. Specifically, we use the Wasserstein conditional GAN formulation (Arjovsky et al. 2017), where the critic network learns to estimate the Wasserstein distance between the high-dimensional distributions of the generated and true fields (
1) Architecture
Most GAN network architectures employ dense convolutional blocks, which have been shown to be effective at extracting representative features from images. Our network architecture is based on the ESRGAN (Wang et al. 2018) setup, using Residual-in-Residual Dense Blocks (RRDBs) as the main convolutional blocks in the generator. RRDBs, introduced by Zhang et al. (2018), employ densely connected convolutional layers to extract important features while also maintaining direct connections from the previous layers to all convolutional layers to create a contiguous memory. Such an architecture provides freedom for the network to extract complex features while also ensuring consistency with previous layers. Specifically, we adapt the architecture employed in Annau et al. (2023) to allow noise input and multiple covariate streams.
Unlike other applications of super resolution, climate downscaling often has access to pertinent HR information during training. A common example of such HR information is topography, which influences local climate strongly. While many previous studies have included topography as a covariate, it has often been input at low resolution with climate covariates, discarding potentially useful information. Harris et al. (2022) included HR covariates by first convolving them down to the shape of the LR covariates. We developed a different strategy, based on depthwise separable networks (Jiang et al. 2020) where the generator contains two input streams, one stream for each resolution. Each stream has equivalent convolutional blocks applied in parallel, and after the LR stream has passed through the upsampling blocks to increase the resolution, the two streams are concatenated and passed through a final convolutional block (Fig. 1).
Architecture of GAN networks showing RRDB with noise injection. Green denotes locations where noise is added into the network. Rectified linear units (ReLUs) are used to introduce nonlinearity and show the negative slope parameter in parentheses. Labels on the convolutional blocks show parameter values, with k, n, and s denoting the size of the kernel, the number of filters, and the stride, respectively. Note that in the final convolutional block of the generator, npred corresponds to the number of predictands.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
In the standard super-resolution formulation, the critic network is only given samples of the predictands (either generated or from the training data). However, Harris et al. (2022) found that passing all available covariates to the critic network can improve its predictions, and in the Wasserstein GAN formulation, it is important that the critic network is able to make relevant estimates of the Wasserstein distance. Given that we want the generator to sample from a conditional distribution, we thus want the critic to make use of conditioning information when quantifying this distribution. Similarly to the generator, we adapted the critic network to include a separate input stream for the LR covariates, which is concatenated after the HR stream has been downsampling via strided convolution (Fig. 1).
2) Noise injection
To improve the model’s ability to sample across the entire range of the conditional distribution, we adjusted the generator architecture to inject noise directly into the latent representations produced by the convolutional layers. Specifically, we based our approach on nESRGAN+ (Rakotonirina and Rasoanaivo 2020) and concatenate uncorrelated, mean zero, unit variance Gaussian noise fields with the latent layers inside each Dense Block (Fig. 1). With our architecture, this leads to six noise injection instances in each Residual Dense Block and 18 noise injections in each RRDB. Our full noise generator comprises 14 RRDBs and two RRDBs in the LR and HR input streams, respectively, followed by a single HR RRDB after concatenation. In total, this results in 252 LR and 54 HR noise layers.
To test the effect of the number of noise injection layers, we replaced some of the stochastic RRDBs with deterministic RRDBs (i.e., RRDBs without noise injection). Altogether, we tested three different levels of noise injection: low (two stochastic LR RRDBs and no stochastic HR RRDBs), moderate (seven stochastic LR RRDBs and no stochastic HR RRDBs), and full noise (14 stochastic LR RRDBs and two HR RRDBs). As a baseline model, we also considered the more standard noise-covariate approach (similar to that used in Leinonen et al. 2021), for which a Gaussian noise field is concatenated with the LR input covariates before passing through the generator network.
3) Losses
In Wasserstein GAN training, the critic tries to maximize the difference in the distributional distance between true fields and generated fields, while the generator attempts to minimize this distance (i.e., it attempts to make it difficult for the critic to distinguish the generated fields from the ground truth data). Previous studies have found that solely relying on the critic loss (adversarial loss) for the generator training can lead to instability, such that the training process does not converge (Wang et al. 2018). Here, we use the common approach of adding an extra content loss term to the generator—a pixelwise error metric between the training data and the generated fields, intended to aid convergence in realization.
4) Training
Since our goal is to sample realizations across the conditional distribution of HR fields, we do not want the loss function to force the generator to create copies of the training data—we consider the ground truth as one realization of the conditional distribution. Ideally, we would expect generated realizations to have similar statistics and large-scale features as the training data, but not be identical. Overreliance on content loss (particularly deterministic measures) can degrade model performance, as it overly penalizes small deviations in feature location/presence. In situations where features are spatially shifted, pixelwise error metrics such as the content loss will penalize the model twice: once for the feature not occurring where it does in the ground truth, and once for the feature occurring where it is not in the ground truth. This double-penalty problem is a well-known issue with pixel-based losses in generative networks (Rossa et al. 2008) and results in overly blurry output, where the model converges on the conditional median (or mean, depending on the form of the content loss), thus suppressing small-scale features and extremes (Annau et al. 2023). If there is a strong association between LR and HR scales, then there is less freedom for variability in the conditional HR distribution. Where there is broad separation, the conditional variability should be greater and could be more heavily penalized by the double-penalty problem. While the adversarial loss in GANs (in our case, the Wasserstein distance) is not a pixelwise metric and does not constrain the network in the same way, the use of content losses can suppress variability.
We considered two training techniques in our models to address the double-penalty problem while rewarding convergence at large scales common to both resolutions: frequency separation and stochastic sampling. Frequency separation, introduced by Annau et al. (2023), coarsens HR fields to only provide low frequencies to the content loss, whereas the full HR fields are used for the adversarial loss. The motivation of this approach is to allow the model to more freely develop high-frequency patterns by removing some of the content loss constraints. Stochastic sampling is an approach modified from Harris et al. (2022), where multiple stochastic realizations are passed to the content loss, which uses an ensemble metric to assess calibration (algorithm 1). In each generator training step, we generate six stochastic realizations of each field in the batch and pass these to the content loss. For the MAE loss, we take the pixelwise average across the stochastic realizations before computing the loss. Both frequency separation and stochastic sampling allow generated variability at small scales but encourage convergence at large scales.
Algorithm 1
Pseudocode for stochastic sampling algorithm using the CRPS metric. Note that we chose six stochastic realizations as the maximum number that did not exceed our graphics processing unit (GPU) memory.
nRealisation ← 6
for i ∈ 1: nBatch do
LRBatch = LRBatchesi
HRBatch = HRBatchesi
adversarialLoss ← −mean[Critic(LRBatch)]
for j ∈ 1: nSample do
subBatch ← repeat(LRBatchj, nRealisation)
subRealisations ← Generator(subBatch, invariant)
CRPSBatchj ← mean[pixelwiseCRPS(subRealisations, HRBatchj)]
end for
contentLoss ← mean(CRPSBatch)
loss ← adversarialLoss + contentLoss
end for
c. Validation
Quality assessment in image generation problems often poses a challenge because there are multiple, often competing, metrics that could be used. Potential metric priorities include convergence in realization or convergence in statistical features (such as spatial covariance or pixelwise marginal distributions). In general, we will consider a combination of these factors, depending on the problem. As noted earlier, while deterministic pixelwise error metrics are important, they should not be relied on too much due to the double-penalty problem. Following Harris et al. (2022), Annau et al. (2023), and Ravuri et al. (2021), we used a radially averaged spectral power (RASP) metric for comparing spatial variance of different scales (or, alternatively, the covariance structure) between the generated fields and the ground truth. We calculated RASP by first performing a two-dimensional Fourier transform on each field, averaging the amplitudes within annular rings centered at wavenumber 0 and then averaging power densities across at least 1000 fields. Ideally, spectral power at each spatial scale should be the same in the generated and truth fields; to aid in visual comparison, we standardized amplitudes at each wavenumber by the amplitude of the ground truth field in the corresponding bin. This quantity allows assessment of biases across spatial scales.
Predicting extremes is crucial, as extreme events usually cause the most damage, both to human society and the environment. There are different possible definitions of extremes; in this paper, we investigate relative extremes, defined as values beyond certain quantiles on the tails of the distribution (IPCC 2023). This approach allows us to analyze near-tail extremes without needing to employ extreme value theory. For analyses investigating distributions of extremes, we tested multiple percentiles (0.1 and 99.9; 0.01 and 99.99; and 0.001 and 99.999). Unless specified, results were similar across all extreme percentiles. Generally, we show the 0.01 and 99.99 quantiles as a representation of nearer and relatively well-sampled extremes.
3. Results
a. Synthetic data
Noise injection
Models with full noise injection performed substantially better overall than the baseline models at matching the true marginal distribution of the synthetic data. A representative example of pixelwise marginal distributions for the truth and generated fields is shown in Fig. 2a. The baseline
(a) KDEs of marginal distributions of p(HR|LR) for the unimodal synthetic dataset for one example pixel (i = 5, j = 5) for the true distribution and generated distributions. KDEs are based on 500 realizations for a single conditioning field for each distribution. Dashed line shows the true marginal distribution. The inset figure shows the full y-axis range. (b) Violin plot showing KS statistic values comparing generated marginal conditional distributions to ground truth distributions for all pixels. Statistics are calculated for each pixel individually, using 500 realizations of a single conditioning field. Lines show 0.25, 0.5, and 0.75 quantiles, respectively. (c) CDF of the rank histogram on unimodal synthetic data, with four models, showing calibration of conditional distributions. The dashed line shows the reference uniform distribution. Rank histograms were calculated across 100 randomly selected conditioning fields, with 96 HR realizations generated for each. (d) KDEs of marginal conditional distributions for one example pixel of a bimodal dataset, comparing true (dashed line) and generated distributions. Distributions were estimated using the same approach as in (a).
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
Decreasing the number of noise injection layers in the generator decreased the performance of conditional distributions (Fig. 2b). Low noise injection (
Rank histograms provide another tool for investigating the calibration of conditional distributions, averaged across multiple conditioning fields. Rank histograms showed good calibration for all models with full noise injection and severe underdispersion for the baseline model (Fig. 2c). The
Learning multimodal distributions is challenging for GANs; they tend to show mode collapse, in which distributions are collapsed to the conditional mean (Saatci and Wilson 2017). We found that using the bimodal dataset [Eqs. (6)–(10)], the
Investigating results of the full distribution p(HR), statistics of generated fields had fewer artifacts and were closer to the ground truth statistics using the
Spatial fields of median and 99.9th percentiles of the full distributions across samples for ground truth and generated data from two models, using the unimodal synthetic dataset [Eqs. (1)–(3)].
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
As well as better capturing pixelwise marginal variability, all noise injection models performed better at representing covariance patterns. The RASP metrics (Fig. 4) demonstrate that the baseline model showed too little power for a range of low wavenumbers and then spurious spikes at other wavenumbers. The
RASP for four models using the unimodal synthetic dataset [Eqs. (1)–(3)]. Values are standardized to amplitudes of ground truth wavenumbers, so perfectly matched spectral power occurs at a value of one. Solid lines and shaded regions, respectively, show the mean and ± one standard deviation across 1200 randomly selected samples. The dashed line indicates the wavenumber corresponding to LR pixel size.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
Overall, synthetic data experiments showed that models using full noise injection were substantially better at capturing conditional distributions than the baseline model. In general, all models with noise injection performed comparably well. There was also noticeable improvement in the quality of the full distributions using noise injection, and the S class models showed improved ability to represent spatial dependence.
b. Wind downscaling case study
As above, we first present results for distributions conditioned on a single set of LR fields before moving to the full distributions across the test set. Note that while all models predicted both zonal (eastward) and meridional (northward) wind components, the results were generally similar, and unless otherwise stated, we only show results for meridional components.
Even with noise injection, the F class models applied to a realistic downscaling problem produced underdispersed results (Fig. 5). While the
CDFs of rank histograms for meridional wind components, using four different models. Rank histograms were calculated across 100 randomly selected conditioning fields, with 96 HR realizations generated for each. The dashed line shows the reference uniform CDF.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
With the
Example meridional and zonal wind fields for coastal British Columbia using the
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
Looking at the full distribution, spectral power was also better calibrated with the S class models (Fig. 7). The baseline model produced low spectral power at both low and high wavenumbers; the performance of the
RASP metric (median and IQR) standardized to ground truth values for zonal and meridional wind fields. Spectral powers are calculated across 1200 randomly selected fields. The dashed line shows the wavenumber corresponding to LR grid size.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
Pixelwise marginal statistics of full distributions differed slightly between models (Fig. 8). Differences in median and interquartile range were spatially smoother with the S class models, suggesting these models were able to capture fine spatial patterns better. The baseline model substantially underestimated interquartile range (IQR) in many locations. This bias was improved with the
Pixelwise median and IQR of the full distribution of the test dataset for meridional wind fields. The first column shows truth statistics, followed by difference fields for each of the four models (truth − model).
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
Investigating the tail behavior of the full distributions, we found that models that were better at learning conditional distributions performed better at predicting accurate extremes (Fig. 9). Comparing moderately large marginal extremes (99.99th and 0.01th percentiles) across time steps, the
Calibration of moderate extremes for meridional wind fields over full distributions. (a) Boxplots of distributions of difference in the 99.99th and 0.01th percentiles of ground truth and generated realizations for four models, based on 500 realizations for each of 350 randomly selected conditioning fields. Values below zero represent model overestimation; values above zero represent underestimation. (b) Difference maps of pixelwise 0.01th percentiles of ground truth and generated realizations for four models.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
Calibration of spatial extremes was also improved in the S class models (Fig. 10). Here, rank histograms represent stochastic calibration of models in regards to spatial extreme values (i.e., large or small quantiles of values across the domain). Rank histograms of 0.01th and 99.99th percentiles across wind fields showed that the
CDFs of rank histograms based on the 0.01th and 99.99th percentiles of meridional wind fields over 400 conditioning fields, with 96 realizations of each field. Dashed lines represent the CDF of a uniform distribution.
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
To investigate why the
Comparison of stochastic calibration of the
Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0044.1
4. Discussion
This paper discusses three main classes of GAN-based downscaling models distinguished by noise type (noise covariate vs noise injection), training method (frequency separation vs stochastic sampling), and content loss type (deterministic MAE vs probabilistic CRPS). We aim to improve stochastic calibration, creating models that can sample successfully from the full range of the conditional distribution. We first present a novel network architecture, where many layers of noise at different resolutions are injected into the generator. Compared to the baseline
Most conditional GANs in super resolution are deterministic, and recent attempts at reintroducing stochasticity have added noise fields as additional covariates (e.g., Leinonen et al. 2021). Our approach of injecting noise directly into the convolutional layers fundamentally differs in that it adds noise to latent representations deep inside the network, instead of to the input. When noise is introduced as a covariate at the beginning of the network, we hypothesize that the network will learn low weights for the noise layers to optimize the loss function. By adding noise to the latent representations, we slightly alter features inside the network, leading to better representation of conditional and full distributions. Our approach is similar to the nESRGAN+ architecture (Rakotonirina and Rasoanaivo 2020) which injects noise inside the Residual-in-Residual Dense Blocks, but we inject noise one level deeper, inside the Dense Blocks, to alter the output of the basic convolutional layers. In contrast to the nESRGAN+, we also use noise injection at both the low and high resolution, allowing for more scales of stochasticity. Interestingly, we never experienced problems with overdispersion as we increased the number of noise injection layers; marginal distributions were closest with the maximum possible number of noise injection layers for a given architecture.
Noise injection is certainly not the only method of creating stochastic output using deep learning methods. Getter et al. (2024) investigated using convolutional neural networks to predict parameters of Gumbel distributions from which wind speed values could be sampled. This is an appealing approach, in that it allows relatively simple distributional analysis, especially for extremes. However, it requires the data to match the selected distributions and thus is less flexible than our nonparametric approach. Also, it is unclear how spatial dependence of realizations would be accounted for in this setup. Another theoretically interesting setup, developed by Saatci and Wilson (2017), involves a Bayesian training framework in which model weights are sampled from learned distributions. However, this method is computationally expensive and likely not applicable for realistic applications. Future work could investigate and compare these approaches.
Stochastic sampling is an approach adapted from Harris et al. (2022), in which the content loss function is calculated on a set of stochastic realizations. While the
All models considered in this study use a separation of spatial scales in the content loss in an effort to address the double-penalty problem inherent in the ill-conditioned nature of climate downscaling. In our synthetic data experiments, we found that partial frequency separation (PFS), as described in Annau et al. (2023), resulted in well-calibrated output. However, this method did not perform as well in the realistic setting of wind component downscaling, motivating consideration of the stochastic sampling approach. Fundamentally, PFS and stochastic sampling have similar goals: allowing the adversarial loss freedom to create small-scale features while rewarding consistency between generated and conditioning fields at large scales. While PFS achieves this by only providing low-frequency information to the content loss, the stochastic sampling approach applies the content loss function across an ensemble of stochastic realizations, thus “smoothing out” the smaller-scale features of the generated fields. The stochastic sampling approach is likely more accurate than PFS—instead of arbitrarily choosing a frequency for separation, the sample conditional means define the transition from conditioning scales to sampling scales. Indeed, we found that for downscaling wind components, stochastic sampling always outperformed models using PFS. A practical challenge with stochastic sampling is that it uses more computational resources during training than PFS models, as each of the stochastic realizations has to be used during backpropagation. Most notably, stochastic sampling nearly doubled the amount of memory required during training, and it increased training time by about 50% (using a stochastic batch size of six). In practice, the choice of training approach will likely depend on the desired outcome and the computational resources available. While the stochastic sampling models performed substantially better at capturing conditional and full distributions, they only performed slightly better at capturing the spatial patterns of single fields. Thus, if the goal is to produce downscaling without needing to capture conditional variability, it may be prudent to use PFS and reduce training requirements. It is, of course, possible that the improvement gained from using stochastic sampling on capturing the conditional distributions will depend strongly on the fields being considered.
Wind fields show a high level of spatial heterogeneity, which we expect is responsible for the difficulties experienced by the
a. Future work
This study presents a framework for stochastic downscaling with GANs. However, there are multiple areas of uncertainty that should be addressed prior to operational use. We only investigated wind fields in a single geographic location; it will be important to test this stochastic GAN framework on a suite of important variables across multiple geographic areas with different degrees of spatial heterogeneity. It may also be useful to investigate potential transformations of data with nonnormal distributions.
It will be necessary to properly account for temporal dependence of the generated realizations. In the current framework, correlations between time steps are only introduced through the LR conditioning fields. Since substantial stochasticity occurs at subgrid scales, there is no mechanism for enforcing consistency of small features between time steps. Future research could adapt the network to include a recurrent architecture, such as the convolutional gated recurrent unit used by Leinonen et al. (2021).
Our study investigates near-tail extremes (0.01 and 0.99 quantiles), and a more detailed analysis of extreme tail behavior using our model is currently being undertaken. While the use of extreme value theory is outside the scope of this paper due to the strong nonstationarity of variables and the relatively short dataset, analyses using extreme value theory would be a valuable extension.
GANs are not the only class of deep learning model suitable for stochastic downscaling, and other models should be investigated. In particular, diffusion models have recently been introduced to many applications, and recent research by Mardani et al. (2025) has shown impressive results using diffusion models for stochastic atmospheric downscaling of four variables. While diffusion models have been favored for being more stable to train than GANs, the Wasserstein GAN is a conceptually different approach to training and in our experience shows excellent stability. Diffusion models may have benefits over Wasserstein GANs but are substantially more computationally expensive to train. For reference, the model developed in Mardani et al. (2025) took 21 504 GPU hours to train, while our GAN models only required about 36 h. Depending on the situation, Wasserstein GANs may be more suitable due to their relative efficiency. Other recent studies have investigated the use of vision transformer models for downscaling instead of convolutional neural networks (CNNs). While transformers are more computationally intensive to train than CNNs, they can automatically account for temporal dependence (Zhong et al. 2024). An interesting avenue of future research would be to compare transformer models to CNN-based models. An optimal model may combine aspects of both architectures.
b. Implications
Modeling of extremes is of utmost importance to climate adaptation, but extremes are often more challenging to model than averages (Thompson et al. 2013). Infrastructure needs to handle precipitation and wind extremes; most heat-related human health issues occur during extreme heat waves (Kephart et al. 2022). Generally, statistical downscaling has not been successful at capturing extremes, and while dynamical downscaling can perform better, it is too computationally intensive for some practical applications. Our study has shown that by improving the ability of GANs to make distributional estimates, we are able to obtain better estimates of near-tail extremes, both spatially and temporally, often with a marginal increase in computational cost. Hence, deep learning–based downscaling shows promise as a statistical downscaling strategy with the ability to more accurately capture extremes. Further research will be required to determine whether these results generalize to a nonstationary climate (e.g., across time periods). If so, deep learning downscaling could become an essential part of climate adaptation for estimating future extremes.
Acknowledgments.
We thank Nicolaas Annau for providing code and for valuable discussion. We also acknowledge Dr. Colin Mahony, Dr. Karen Price, and Dr. Alex Cannon for helpful comments and discussion. The manuscript was improved by the thorough reviews of Dr. Julie Bessac and two anonymous reviewers. Finally, KD thanks Chloe Leroy and Lucy for their constant support. Funding for this work was provided by the British Columbia Ministry of Forests through the ClimatEx program and by the British Columbia Graduate Fellowship.
Data availability statement.
Our fully stochastic GAN with CRPS loss is implemented with PyTorch at https://github.com/nannau/ClimatExML/tree/stochastic. LR conditioning fields are available from ERA5 (https://cds.climate.copernicus.eu/) and HR WRF training data are available at https://www.gwfnet.net/Metadata/Record/T-2020-05-28-Q1KtfEjVdz0aBmauuvG9r9w.
REFERENCES
Afshari, A., J. Vogel, and G. Chockalingam, 2023: Statistical downscaling of SEVIRI land surface temperature to WRF near-surface air temperature using a deep learning model. Remote Sens., 15, 4447, https://doi.org/10.3390/rs15184447.
Annau, N. J., A. J. Cannon, and A. H. Monahan, 2023: Algorithmic hallucinations of near-surface winds: Statistical downscaling with generative adversarial networks to convection-permitting scales. Artif. Intell. Earth Syst., 2, e230015, https://doi.org/10.1175/AIES-D-23-0015.1.
Arjovsky, M., S. Chintala, and L. Bottou, 2017: Wasserstein generative adversarial networks . ICML’17: Proc. 34th Int. Conf. on Machine Learning—Vol. 70, Sydney, NSW, Australia, PMLR, 214–223, https://dl.acm.org/doi/abs/10.5555/3305381.3305404.
Chen, F., Y. Liu, Q. Liu, and X. Li, 2014: Spatial downscaling of TRMM 3B43 precipitation considering spatial heterogeneity. Int. J. Remote Sens., 35, 3074–3093, https://doi.org/10.1080/01431161.2014.902550.
Fischer, E. M., S. Sippel, and R. Knutti, 2021: Increasing probability of record-shattering climate extremes. Nat. Climate Change, 11, 689–695, https://doi.org/10.1038/s41558-021-01092-9.
Getter, D., J. Bessac, J. Rudi, and Y. Feng, 2024: Statistical treatment of convolutional neural network superresolution of inland surface wind for subgrid-scale variability quantification. Artif. Intell. Earth Syst., 3, e230009, https://doi.org/10.1175/AIES-D-23-0009.1.
Goodfellow, I. J., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, 2014: Generative adversarial networks. arXiv, 1406.2661v1, https://doi.org/10.48550/arXiv.1406.2661.
Harris, L., A. T. T. McRae, M. Chantry, P. D. Dueben, and T. N. Palmer, 2022: A generative deep learning approach to stochastic downscaling of precipitation forecasts. arXiv, 2204.02028v2, https://doi.org/10.48550/arXiv.2204.02028.
Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803.
IPCC, 2023: Weather and climate extreme events in a changing climate. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 1513–1766.
Jiang, Z., Y. Huang, and L. Hu, 2020: Single image super-resolution: Depthwise separable convolution super-resolution generative adversarial network. Appl. Sci., 10, 375, https://doi.org/10.3390/app10010375.
Kephart, J. L., and Coauthors, 2022: City-level impact of extreme temperatures and mortality in Latin America. Nat. Med., 28, 1700–1705, https://doi.org/10.1038/s41591-022-01872-6.
Ledig, C., and Coauthors, 2017: Photo-realistic single image super-resolution using a generative adversarial network. 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, Institute of Electrical and Electronics Engineers, 105–114, https://doi.org/10.1109/CVPR.2017.19.
Leinonen, J., D. Nerini, and A. Berne, 2021: Stochastic super-resolution for downscaling time-evolving atmospheric fields with a generative adversarial network. IEEE Trans. Geosci. Remote Sens., 59, 7211–7223, https://doi.org/10.1109/TGRS.2020.3032790.
Li, Y., Z. Li, Z. Zhang, L. Chen, S. Kurkute, L. Scaff, and X. Pan, 2019: High-resolution regional climate modeling and projection over western Canada using a weather research forecasting model with a pseudo-global warming approach. Hydrol. Earth Syst. Sci., 23, 4635–4659, https://doi.org/10.5194/hess-23-4635-2019.
Lucas-Picher, P., D. Argüeso, E. Brisson, Y. Tramblay, P. Berg, A. Lemonsu, S. Kotlarski, and C. Caillaud, 2021: Convection-permitting modeling with regional climate models: Latest developments and next steps. Wiley Interdiscip. Rev.: Climate Change, 12, e731, https://doi.org/10.1002/wcc.731.
MacKenzie, W. H., and C. R. Mahony, 2021: An ecological approach to climate change-informed tree species selection for reforestation. For. Ecol. Manage., 481, 118705, https://doi.org/10.1016/j.foreco.2020.118705.
Maraun, D., 2013: Bias correction, quantile mapping, and downscaling: Revisiting the inflation issue. J. Climate, 26, 2137–2143, https://doi.org/10.1175/JCLI-D-12-00821.1.
Mardani, M., and Coauthors, 2025: Residual corrective diffusion modeling for km-scale atmospheric downscaling. Commun. Earth Environ., 6, 124, https://doi.org/10.1038/s43247-025-02042-5.
Miralles, O., D. Steinfeld, O. Martius, and A. C. Davison, 2022: Downscaling of historical wind fields over Switzerland using generative adversarial networks. Artif. Intell. Earth Syst., 1, e220018, https://doi.org/10.1175/AIES-D-22-0018.1.
Mirza, M., and S. Osindero, 2014: Conditional generative adversarial nets. arXiv, 1411.1784v1, https://doi.org/10.48550/arXiv.1411.1784.
Price, I., and S. Rasp, 2022: Increasing the accuracy and resolution of precipitation forecasts using deep generative models. Proc. 25th Int. Conf. on Artificial Intelligence and Statistics, Valencia, Spain, PMLR, 10 555–10 571, https://proceedings.mlr.press/v151/price22a/price22a.pdf.
Rakotonirina, N. C., and A. Rasoanaivo, 2020: ESRGAN+: Further improving enhanced super-resolution generative adversarial network. ICASSP 2020—2020 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, Institute of Electrical and Electronics Engineers, 3637–3641, https://doi.org/10.1109/ICASSP40776.2020.9054071.
Ravuri, S., and Coauthors, 2021: Skilful precipitation nowcasting using deep generative models of radar. Nature, 597, 672–677, https://doi.org/10.1038/s41586-021-03854-z.
Rossa, A., P. Nurmi, and E. Ebert, 2008: Overview of methods for the verification of quantitative precipitation forecasts. Precipitation: Advances in Measurement, Estimation and Prediction, 1st ed. S. Michaelides, Ed., Springer, 419–452.
Saatci, Y., and A. G. Wilson, 2017: Bayesian GAN. NIPS’17: Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, CA, Curran Associates Inc., 3625–3634, https://dl.acm.org/doi/10.5555/3294996.3295120.
Skamarock, W. C., J. B. Klemp, and J. Dudhia, 2001: Prototypes for the WRF (Weather Research and Forecasting) model. Ninth Conf. Mesoscale Processes, Fort Lauderdale, FL, Amer. Meteor. Soc., J1.5, https://ams.confex.com/ams/pdfpapers/23297.pdf.
Stengel, K., A. Glaws, D. Hettinger, and R. N. King, 2020: Adversarial super-resolution of climatological wind and solar data. Proc. Natl. Acad. Sci. USA, 117, 16 805–16 815, https://doi.org/10.1073/pnas.1918964117.
Thompson, R. M., J. Beardall, J. Beringer, M. Grace, and P. Sardina, 2013: Means and extremes: Building variability into community-level climate change experiments. Ecol. Lett., 16, 799–806, https://doi.org/10.1111/ele.12095.
Wang, L., W. Chen, W. Yang, F. Bi, and F. R. Yu, 2020: A state-of-the-art review on image synthesis with Generative Adversarial Networks. IEEE Access, 8, 63 514–63 537, https://doi.org/10.1109/ACCESS.2020.2982224.
Wang, X., K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. C. Loy, 2018: ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. Computer Vision—ECCV 2018 Workshops, L. Leal-Taixé and S. Roth, Eds., Lecture Notes in Computer Science, Vol. 11133, Springer, 63–79.
Wilks, D. S., 2010: Use of stochastic weathergenerators for precipitation downscaling. Wiley Interdiscip. Rev.: Climate Change, 1, 898–907, https://doi.org/10.1002/wcc.85.
Zhang, Y., Y. Tian, Y. Kong, B. Zhong, and Y. Fu, 2018: Residual dense network for image super-resolution. 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, Institute of Electrical and Electronics Engineers, 2472–2481, https://doi.org/10.1109/CVPR.2018.00262.
Zhong, X., F. Du, L. Chen, Z. Wang, and H. Li, 2024: Investigating transformer-based models for spatial downscaling and correcting biases of near-surface temperature and wind-speed forecasts. Quart. J. Roy. Meteor. Soc., 150, 275–289, https://doi.org/10.1002/qj.4596.