1. Introduction
Having access to large sets of weather forecasts or reforecasts is of plain importance in many applications. For instance, some fundamental and applied studies in weather science rely on large reforecasts of events, for example, to detect climatological trends on specific patterns such as extratropical depressions (Pantillon et al. 2017) or heavy precipitating events (Ponzano et al. 2020). Such reforecasts, like operational forecasts in many centers, are usually based on ensemble prediction systems (EPSs). However, the high computing and storage costs of these systems can limit their configuration to a few dozen members for convection-permitting, kilometer-scale ensembles. This is often not enough to accurately sample the future distributions of weather variables. As a consequence, emulation of forecasts at different scales can have potential applications for both climatological studies and operational weather forecasting.
Increasing the ensemble size without resorting to the costly option of running additional EPS members remains an open challenge. Existing solutions mainly rely on different flavors of neighborhood approaches (Roberts and Lean 2008; Ebert 2008) that are based on the assumption of locally homogeneous weather. An original approach has been proposed by Vincendon et al. (2011) to design perturbed precipitation forecasts by applying location and intensity perturbations to the deterministic forecast. In recent years, generative deep learning has emerged as a novel approach that can produce accurate synthetic data, and it has recently seen a broadening use by the NWP community. In particular, generative adversarial networks (GANs; Goodfellow et al. 2014) and variational autoencoders (VAEs; Kingma and Welling 2014) have already been used for several NWP applications (Bihlo 2020; Ravuri et al. 2021; Leinonen et al. 2021; Bhatia et al. 2021; Harris et al. 2022). Of particular interest is the recent work of Besombes et al. (2021) that demonstrates the ability of a GAN to emulate realistic atmospheric states (accounting for several variables at different atmospheric levels) when trained on outputs of a simple climate model at a relatively coarse resolution (≈300 km).
Generative techniques allow sample draws from a simple, latent distribution, which are then mapped into higher-dimensional spaces (e.g., the space of NWP model outputs). VAEs have an encoder-latent-decoder structure; their latent space is used both to embed the input samples and then to generate maximum-likelihood samples from a parameterized distribution. This setup is prone to creating noisy samples, as exemplified by Dumoulin et al. (2016), and the denoised samples can be blurry (Kingma and Welling 2019). GANs, on the other hand, are composed of two competing networks (named the generator and the discriminator). Once trained, the generator of a GAN usually produces highly detailed images (Radford et al. 2015; Karras et al. 2018).
Although the performances of GANs can be appealing, it can be challenging to stably train a GAN model (Goodfellow et al. 2014; Arjovsky et al. 2017) as they are commonly affected by several obstacles. The first one is mode collapse, which is the concentration of the samples produced toward a small portion of the training distribution and, in extreme cases, a single sample. This can be the case if the generator begins to reproduce a specific subset of the training set to anomalously sharp numerical precision (Radford et al. 2015). It is a kind of overfitting. A second difficulty is the sudden loss of quality of the generated samples. This can be due to inefficient feature extraction from the discriminator, or from discriminator’s overfitting (Brock et al. 2018). This polarity between samples quality and distribution recovery has been termed the quality-diversity trade-off (Brock et al. 2018) or, more recently, the perception-distortion trade-off (Blau and Michaeli 2018). Therefore, evaluating a GAN should focus on two main aspects: the intrinsic quality of the samples and the recovery of the main features of the training dataset; this requires specific metrics.
Building on the work of Besombes et al. (2021), the objective of the present article is to examine the ability of GANs to emulate multivariate outputs of NWP models at a kilometric resolution, close to the one studied by Ravuri et al. (2021). To the authors’ knowledge, this aspect has not been evaluated yet, and the sensitivity of GAN training advocates for a dedicated study. Two important questions will be addressed: Are GANs effectively able to emulate multivariate outputs with a proper representation of every spatial scale? How can one evaluate the diversity and realism of the outputs of a GAN trained on such data? This is a preliminary step before using GANs to enhance EPSs, although such a task is left for future work.
This study proposes the training of a residual, spectrally normalized Wasserstein–GAN (Miyato et al. 2018), using kilometer-scale model outputs from the AROME-EPS. The AROME-EPS dataset involves several fields exhibiting fine-scale variations, such as 10-m wind speed components and 2-m temperature. To analyze what effects different variables have on training, several configurations will be examined with distinct sets of variables. Borrowing from weather science and computer vision, a comprehensive set of metrics is considered to assess different aspects of the GAN’s outputs. The spatial structure of emulated fields is evaluated with spectral transforms, correlation length scales, and scattering coefficients. These metrics are complemented with distributional distances, using pixelwise Wasserstein distance (Besombes et al. 2021) and sliced Wasserstein distance (Rabin et al. 2011; Karras et al. 2018). With this set of metrics, a detailed view of the GAN’s capabilities and weaknesses is provided, and we assess its sensitivity to the choice of hyperparameters and to the chosen architecture and setup.
The outline of the paper is as follows. The dataset and choices made for the setup are detailed in section 2; this includes choice of data, network architecture, and implementation of the GAN training algorithm. Section 3 details the whole set of metrics to be used in evaluation, and the evaluation strategy. Section 4 presents the main results obtained for the joint emulation of three AROME variables with a GAN. Section 5 compares the results obtained when varying the number and nature of the AROME variables used as predictors. Section 6 discusses the results. Section 7 provides conclusions and opens some perspectives for future work.
2. Generating AROME forecasts with a GAN: Setup choices
a. Dataset and problem formulation
The dataset used is made of forecasts from the 16-member, 1.3-km resolution AROME-EPS (Bouttier et al. 2012; Raynaud and Bouttier 2016), covering about 17 continuous months, from 15 June 2020 up to 12 November 2021. AROME-EPS is the ensemble version of the AROME limited-area, convection-permitting model (Seity et al. 2011) used operationally at Météo-France. The version of AROME-EPS considered uses a 1.3-km grid resolution and produces outputs on a regular latitude–longitude grid at 0.025° resolution. Initial conditions are built using the AROME 3D-Var analysis (Brousseau et al. 2011) and perturbations from the AROME Ensemble Data Assimilation (Montmerle et al. 2018), while lateral boundary conditions are given by forecasts of the global ARPEGE-EPS model (Descamps et al. 2015). AROME-EPS also uses the stochastically perturbed parameterization tendencies (SPPTs) scheme (Bouttier et al. 2012) and surface perturbations (Bouttier et al. 2016).
AROME-EPS 16-member forecasts are launched daily at 2100 UTC, and the first 24 h of prediction with 3-hourly outputs are used for training. The fields considered are the two horizontal directions of wind at 10-m height, referred to as u (the zonal component) and υ (the meridional component), and the 2-m temperature (t2m). Additionally, as will be detailed in section 5, orography is used in some experiments as a constant field for the GAN to generate.
To provide a flexible experimental setup and to keep the training runs in reasonable time windows, a small subregion of the AROME domain is selected. It corresponds to the Mediterranean coastal region and the Rhône valley (see Fig. 1). The subdomain considered thus spans 128 × 128 grid points, with an approximate 330-km side size. This choice of localization is motivated by the variable terrain features and identifiable weather patterns known to occur in this region. The joint presence of the French Alps and Mediterranean Sea, as well as marked episodes of strong northerly winds (mistral events) and heavy precipitating events characterized by localized strong gradients, both made an interesting case for trials on this region. Moreover, the high resolution of the samples allows for investigation of several scales of variability, from the regional scale down to the typical grid scale of state-of-the-art convection-permitting models.
In the baseline configuration, fields of u, υ, and t2m for a given lead time, date, and ensemble member are learned jointly as part of the same sample. One single day of forecasts hence yields 8 × 16 = 128 distinct data samples. Altogether, the usable dataset is then composed of 66 048 samples. The shape of the GAN output tensors is then 3 × 128 × 128, with 3 being the number of variables and 128 × 128 the domain size.
Using a dataset from AROME-EPS increases the volume of training data compared to using the deterministic AROME forecasts over the same period. It is important to note that many of the samples are, indeed, correlated, whether they correspond to close lead times or to different members of a same forecast. However, for a given forecast, samples corresponding to different lead times and different ensemble members are physically distinct, namely because of the fine-scale variability of wind fields and of the diurnal cycle of temperature. The ensemble-based dataset thus exhibits an increased small-scale diversity compared to a deterministic dataset. It is possible, therefore, to view the ensemble as an NWP-specific data augmentation strategy, implemented upon a deterministic forecast system on a given period. This study will thus assess to what extent a GAN can recover this given, fixed distribution of high-resolution samples enriched by the EPS.
b. Choices for the GAN architecture
The GAN framework has been thoroughly investigated in recent years, and some guidelines have emerged to design efficient and reliable GAN training algorithms. Let us denote with
The objective function must ensure that outputs from D confer high scores to the “real” distribution while maintaining low scores on outputs from G. With D being optimized, G then aims at producing samples confusing D, that is, obtaining high scores from D. An optimal training then results in D being unable to separate real samples from fake ones, though being optimally designed to separate them. Ideally, then, the distribution produced by G completely and correctly recovers
The GAN training framework exhibits convergence and stability issues. Notably, the well-known “mode collapse” problem consists of G concentrating the mass of
To the authors’ knowledge, this study is one of the first to propose a WGAN–SN framework for geophysics applications. Among the studies dealing with the generation of atmospheric fields with GAN, only Ravuri et al. (2021) use SN, while others, such as Besombes et al. (2021) and Harris et al. (2022), keep using the WGAN–GP formulation. Miyato et al. (2018) only apply spectral normalization on discriminator layers, but literature since then has acknowledged the positive effect of using SN on both generator and discriminator (Brock et al. 2018). This double regularization is implemented in the present setup.
GANs can also produce unstructured or strongly corrupted samples, depending on what features from the dataset the discriminator is able to identify as crucial. This failure mode can happen in the absence of mode collapse or concomitantly to it (Brock et al. 2018; Mescheder et al. 2018). This makes the training of GANs difficult, even within the Wasserstein–GAN framework, and special care has to be devoted to the networks’ hyperparameters choice.
A residual architecture is chosen, consisting of two residual nets directly taken from Miyato et al. (2018), as shown in Fig. 2. Starting from a 64-dimensional latent space, samples are shaped into feature maps whose resolution gradually increases as they go throughout the generator, to be finally outputted through a tanh layer. The discriminator follows a symmetric pattern downscale, though being one layer deeper and involving more channels before the last dense, output layer. It should be emphasized that although the networks are relatively shallow, their estimated receptive fields (i.e., the size of the input region which contributes to a given output pixel) are large enough to model long-range correlations. For example, a single residual block of the generator involves two 3 × 3 convolutions and one upsampling layer with a dilation factor of 2. This configuration gives a maximal local receptive field of 10 pixels for each residual block. Since we stack five of these blocks, the final, global receptive field spans beyond 128 pixels, which is the size of our samples. Therefore, the final sample is able to take into account each degree of freedom from the random input.
Before training, the samples are rescaled so that the global minimum and maximum values of the dataset fit within the [−0.95, 0.95] range. This is to ensure that the hyperbolic tangent output of the generator can reach the dataset’s extremes and even go beyond these limits. The mean, minimum, and maximum values are precomputed over all grid points and all data samples. Supplementary training parameters and procedures (floating precision, warm-up, initialization) are detailed in appendix A. These models are trained for a fixed number of 60 000 update steps, on a cluster of four NVIDIA V100 graphics processing units (GPUs) with 32 GB of RAM. Models are thus trained for 4–12-h wall-clock time, depending on the batch size. A step is equivalent to one update of both networks after forward and backward pass. Thus, for two different batch sizes, this fixed number of steps allows for different numbers of epochs (i.e., sets of steps corresponding to 100% of the dataset seen by both networks).
3. Evaluation metrics
a. Distributional metrics
Following the strategy of Besombes et al. (2021), two 1D-EMD estimates are computed. At each test step, a random number of pixels is sampled to evaluate the average of per-pixel, per-variable 1D EMDs; another average of 1D EMDs is also evaluated on a fixed number of pixels, covering the central 64 × 64 crop of the domain (and averaged over variables). These estimates are hereafter termed W1,r (random pixels) and W1,c (central crop) and are a global measure of the quality of the generation of marginal (per-pixel, per-variable) distributions.
A third estimate of EMD is taken from Karras et al. (2018) and termed multilevel sliced Wasserstein distance (SWDmulti). This estimate measures multidimensional EMDs of images at different resolutions. Precisely, it decomposes the image signals on four different resolution levels and generates a Laplacian pyramid: starting from the finest-grained level, each image is obtained from the previous by Gaussian filter convolution, difference, and subsampling. One then measures EMD on each of the levels obtained, for the joint distribution of all variables. These estimates go from the fine-grained level (where images have 128 × 128 dimensions and conserve small-scale fluctuations) to the coarse-grained level (where images only have 16 × 16 dimensions and conserve low-frequency fluctuations). They are used to compare the distribution of patterns at each level. The name sliced Wasserstein distance refers to the way the estimates of EMD on multidimensional spaces are performed. This estimation procedure is unbiased (Rabin et al. 2011; Kolouri et al. 2018) and robust as long as the number of samples is large enough. This yields a four-component distance (one component for each level): SWDmulti = (SWD128, SWD64, SWD32, SWD16). Following the intuition of Rabin et al. (2011), the SWDmulti metric follows a kind of “wavelet decomposition” approach, as it measures the discrepancy between two image distributions at several scales of fluctuations for local neighborhoods, merging the contributions of different variables. A visual explanation of this metric is shown in Fig. 3, and a detailed description is given in appendix B.
A lower bound for these distances is set by their estimate from the AROME-EPS to itself. While the theoretical distance is 0, estimating it through finite batches from the AROME-EPS dataset yields a positive result. This distance is estimated by a bootstrap procedure. We select several batches of 16 384 samples (with no replacement within one batch but with replacement from one batch to the other) and then compute the EMD estimates between pairs of these batches. The average value of the pair-EMD series is kept as the distance estimate (see Table 1). The 16 384 samples represent nearly 25% of the full dataset; thus, we deem that this procedure represents correctly the diversity of the dataset. The values obtained correspond to lower bounds for EMDs, as they reflect the internal variability of the dataset and account for finite sampling effects. If an EMD estimate reaches the lower-bound values, it can be said that the GAN dataset would be completely indistinguishable from the AROME-EPS dataset regarding this estimate.
Estimates of the EMD from the AROME-EPS dataset to itself. These are estimated from 32 independent selections of 16 384 batch pairs. Shown is average/standard deviation for the series of tested pairs.
b. Power spectral density error
Going further requires us to consider supplementary diagnostics in the evaluation procedure, which are presented in the remainder of this section.
c. Correlation lengths
d. Second-order scattering coefficients
Here, we present scattering coefficient approaches (Andén and Mallat 2014; Bruna and Mallat 2013; Cheng et al. 2020) and show how they can be used to extract useful information from our meteorological dataset. Scattering coefficients are derived from recursive wavelet filtering of the fields. They have already been applied to meteorological information classification with satisfactory results, for example, in Garcia et al. (2015). The calculation of first- and second-order scattering coefficients is detailed in appendix B. Let λ1 and λ2 denote scales with λ1 < λ2. The scales considered can go from two grid points (∼2.5 km) to 64 grid points (∼160 km), and wavelet filters ψ for first- and second-order coefficients have different orientations θ1, θ2.
First-order scattering coefficients S1(λ1) relate to the intensity of signal energy at scale λ1. As indicated by Cheng et al. (2020), they play a similar role to spectral decomposition.
Figure 4 shows the chain of transforms: starting from a full AROME image, successive scattering steps are applied. Regions with sharp wind gradients are highlighted by the first pass of wavelet at scale λ1 = 2 pixels. The second-order pass enhances clusters of such regions at scale λ2 = 4 pixels. Such clusters contribute significantly to the S2 coefficient (e.g., the top-right and center-left green frames in Fig. 4). Regions with sharp variations organization at larger scales (central green frame in Fig. 4) contribute to a lesser extent, while quasi-uniform regions are smoothed out and have only negligible contribution (bottom-left white frames in Fig. 4).
Both s21 and s22 estimators provide a set of coefficients (one for each λ1, λ2 couple with λ1 < λ2). One can estimate the discrepancy between the GAN and AROME-EPS data for both the s21 and s22 distributions as a measure of good reproduction of atmospheric structures. The sets of average s21 and s22 can be used to calculate RMSE distances between AROME-EPS and the GAN, yielding two new metrics per variable. These distances are evaluated during training with 16 384 batches and serve as quality estimators, similar to PSD errors.
e. Complementary metrics
Cross-variable correlations will be assessed using bidimensional histograms similar to those used by Gagne et al. (2020). For a given pair of variables and a given dataset, these histograms provide the empirical density function for the values taken by the pair of variables. Ideally, the densities outputted by the GAN should overlap the densities extracted from the AROME-EPS. This is a graphical illustration of the precision-recall metrics used, for example, in Kynkäänniemi et al. (2019). Among others, this enables the identification of systematic biases in the GAN-produced distribution. Finally, maps of 10th and 90th percentiles and interpercentile range will be examined in order to focus on the representation of distribution tails.
The whole set of metrics used is summarized in Table 2, detailing attributes for each. As explained above, the metrics are used to measure either diversity or quality. They make use of several types of information: pixelwise information does not take any spatial correlation into account, neighborhood information represents the aggregation of information over a limited range of pixels, and, finally, nonlocal information aggregates information over the full sample scale, either by Fourier transform, or random sampling of neighborhoods. Finally, metrics measure different features of the signal, notably scale-by-scale information, multiscale organization, positional information (when the metric is plotted on a map), or anisotropy.
Summary of the metrics used in the text. Checkmarks indicate which attributes a given metric possesses, crosses indicate the absence of such attribute. Correlation is abbreviated as corr., err. means error, and est. indicates estimate.
f. Evaluation strategy
We chose not to, a priori, separate the dataset between training, validation, and testing data. Though it is similar to the methodology of Besombes et al. (2021), this is arguably not a common practice in machine learning. Nevertheless, it makes sense in our setup, for the following reasons:
-
The generator is unconditional and never takes any inputs other than latent vectors. Its ability to generate good-quality, high-resolution samples and a correct distribution, once trained, only depends on the mapping it makes between the Gaussian latent distribution and the distribution of samples in the “physical space.”
-
The aim of this study is to assess what features of the data distribution the GAN is able to produce. Comparing the output distribution of the GAN to the training distribution thus avoids having to take into account the necessary distribution shift that occurs when splitting the dataset between train, test, and validation.
-
Detecting mode collapse or, more generally, the loss of diversity, can be done through the combined use of EMD distances, since these metrics provide large scores to distributions with too low spread. Cross-variable correlations can also help detect biased or nonoverlapping distributions.
-
Loss of quality can be examined through average PSD error, average correlation length scales estimation, and average scattering coefficient errors, each metric having its own scope.
To compare different hyperparameter sets, we focus exclusively on EMD distances and PSD error. Once a satisfactory set is selected, we examine the samples produced by the GAN with the other metrics.
To be consistent with the lower-bound estimation procedures detailed in section 3a, each metric is applied to 16 384 random samples from the AROME-EPS dataset and to the same number of random GAN outputs. Especially for SWDmulti distances, this rather large number of samples reduces the estimator’s variance, as in Odena et al. (2017) and Karras et al. (2018).
4. Results
a. Training stability and convergence
Even with widely used regularization strategies, training a GAN requires careful tuning of parameters to ensure local convergence.
Initializing with a learning rate lr0 = 4 × 10−3 and using exponential learning-rate decays (lr = lr0γt, with t being the number of epochs and γ = 0.9) avoids mode collapse and produces realistic-looking samples for all batch sizes except the largest (512). Setting different learning rates or different decay rates γ for D and G was detrimental to quality and convergence, even from a mere visual perspective. In agreement with Mescheder et al. (2018), removing learning-rate decay had terrible results on performance, leading to severe mode collapse and forcing each pixel to the global dataset average value (close to 0). Keeping γ = 0.9, we select the best-performing configuration among several batch sizes and learning rates according to estimates of W1,r/c, SWDmulti, and PSD errors, keeping the rest of metrics for posttraining evaluation. The configuration with a batch size (BS) of 32 and lr0 = 4 × 10−3 is selected as it globally has the lowest distributional distances and PSD errors simultaneously. Details of the hyperparameters selection are shown in appendix C.
Table 3 shows the scores obtained at the end of the run (i.e., when the loss curve plateaus). Figure 5 compares the average spectrum produced by the GAN to the AROME-EPS spectrum, showing good agreement and low error (below or around 1 dB for each variable), for all scales. Even the sharp variations of temperature spectra (mostly due to topographic variations) are correctly reconstructed despite slightly higher PSD errors. Notably, for the smallest scales, no significant drop of PSD can be observed, meaning that even large wavenumbers are given a correct amount of energy.
Scores obtained with the different metrics used by the best-performing configuration of hyperparameters: PSD errors, SWD estimates, and RMSE of scattering estimators with respect to AROME. Each configuration was run three times to account for training variability; the scores presented are the best obtained among these runs. Appendix B (especially Figs. B1 and B2) provides means to interpret the absolute values herein provided.
One noticeable feature of Table 3 is the large difference observed from one SWD component to another: they improve as the component’s scale decreases. Such behavior was observed with several hyperparameter configurations. Small-scale components such as SWD128 and SWD64 get the lowest (i.e., best) scores: small-scale distributions of local patterns are hence more correctly fitted, at least by the best configurations. On the other hand, even these configurations have sensibly higher SWD32 and SWD16 values. The floor values given in Table 1 are indeed larger for SWD16 than for other SWD components, indicating a larger intrinsic variability of the dataset for these scales. However, the gap between the scores of the GAN capabilities and the AROME lower bound is larger for SWD32 and SWD16: this observation might then indicate a poorer fit of large-scale pattern diversity. This result was observed for all hyperparameter configurations (cf. appendix C). It is then reasonable to think this is related to the network architecture or to the training setup.
In the remainder of this section, we provide an in-depth analysis of the GAN performances using the set of metrics previously introduced.
b. Validation of results
1) Visual examination
As a first step, the quality of the GAN generations can be assessed subjectively. Some samples are presented in Figs. 6 and 7 for visual comparison.
First, the resolution of GAN-produced samples appears to be correct when compared to AROME-EPS outputs. As can be seen especially on temperature maps, fine-grained mountainous regions are correctly generated by the GAN, with a visible cooling with altitude. More detailed comparisons can be made with the help of the geographical features given in Fig. 1. Lengthy wind structures are preferentially created over sea, while they are more granular over land. Specific weather patterns are also present in the GAN’s samples, such as strong northerly wind running down the Rhône valley, an event locally known as mistral. Figure 7 also shows that the GAN is capable of producing consistent wind direction and speed at the highest detail level. The wind map especially confirms the ability of the GAN to generate not only events such as mistral but also a rather large diversity of wind patterns. Qualitatively speaking, our GAN is thus arguably devoid of mode collapse. What is more, appendix D shows that the GAN does not simply memorize exact samples from the dataset but, indeed, produces unseen, distinct data samples.
On the other hand, the GAN often produces blurry wind patterns over sea, while AROME-EPS samples are significantly more structured. The clear wind fronts present in the AROME-EPS dataset also appear in the GAN’s samples, but, except for the mistral case, they lack a long-range consistency and the filamentary aspect of AROME-EPS wind.
2) EMD maps
Pixelwise maps of EMD for chosen iterations indicate which regions provide the largest divergence. Figure 8 shows a comparison between the original dataset variance for each variable and two maps of pixelwise EMD at two different steps of training. Regions with highest variance are globally less easily learned by the GAN than their low-variance counterparts, especially at the beginning of training. The land–sea mask is well visible here, as can be expected from physical arguments. Indeed, wind variability over sea is higher: it depends more on the global weather situation (e.g., the presence of fronts) and is not forced by the topography. It coincides with long-range structures with a broad range of directions and intensities. This is not the case over land, where surface plays a major role in reducing the range of correlations. The northerly mistral path is clearly highlighted, as well as the easternmost and westernmost wind variability poles (roughly corresponding to tramontane wind episodes in the west). On the other hand, sea temperature is relatively stable because of the water thermal inertia, while the diurnal cycle is far more pronounced over land. Especially, temperatures at mountainous summits are difficult to reproduce because cold extremes of the distribution are susceptible to occur there.
This implies variance-related error is probably a strong learning signal for the GAN, especially at the beginning of the process. It is consistent with the observations made in the previous subsection about position-related distributions being the easiest to fit.
3) Correlation length scales
The average spectrum of our dataset is almost perfectly fit by our GAN, but it does not take much time for a human observer to distinguish between AROME-EPS and GAN samples. To further analyze the spatial structures of AROME and GAN fields, maps of correlation lengths Lcorr are shown on Fig. 9. These maps show a correct reconstruction of length scales on land, where the location of high and low correlation areas is accurate and the length scale magnitude order is right. On the contrary, length scales over sea are noisy and exhibit artifacts (checkerboard patterns and border effects), showing a clear gap of quality with respect to land. Note that these artifacts only appear while inspecting this specific metric; they are either difficult or impossible to spot on individual samples. As will be discussed in section 6, this is linked to positional information given by specific land patterns. As this information vanishes over sea, nonoptimized gradients may show up on this subdomain.
4) Second-order scattering metrics
The s21 and s22 coefficients are plotted in Fig. 10. One can thus compare the distributions of these estimators for GAN samples with the ones of AROME-EPS. At least for small scales, the AROME-EPS dataset is significantly sparser than its GAN counterpart (higher s21). Moreover, AROME-EPS presents sensibly higher s22, indicating it contains more anisotropic, filamentary structures than the GAN samples. The s22 “shape” estimators are rather better fitted by the GAN than the sparsity s21 estimators. These observations are consistent with visual inspection of samples. While average spectrograms are nearly indistinguishable for this run, most s21 estimators differ significantly. Indeed, the average coefficients are at least one standard deviation away from one another for small λ1. This difference weakens with larger λ1, showing that large-scale organization is better recovered by the GAN.
Both s21 and s22 distances decrease with training, and this is consistent with the increasing quality of GAN outputs (not shown). However, the training dynamics is different from one variable to another. While the sparsity s21 distance for t2m is higher than for u and υ, the s22 distance is lower for t2m than for the wind variables. Globally, this indicates that both s21 and s22 are reasonable estimators to describe the GANs performance to reproduce the AROME-EPS field’s structures.
5) Bivariate histograms
Figure 11 shows bivariate histograms of AROME-EPS and GAN samples. A first observation is that the mean and variance of all variables are adequately captured by the GAN. The GAN also surprisingly extrapolates beyond AROME-EPS’s data, putting significant probability mass on regions closer to the dataset extremes. Meanwhile, it withdraws mass on other parts of the AROME-EPS distribution. Nevertheless, the logarithmic density scale of the histogram shows that the main modes of the distributions overlap, strengthening the assessment of a correct, global behavior.
6) Percentiles and interpercentile range
To complete the overview of generation performance, a comparison of percentiles is performed over 66 048 samples (i.e., the exact size of the dataset to avoid sampling-related bias). The quantities considered are the 90th and 10th percentiles (Q90, Q10), as well as the 10–90 interpercentile range (ΔQ). Figure 12 compares the GAN and AROME-EPS statistics. The maximum percentile error of the GAN is limited but can go up to 3–4 m s−1 for wind data and 4 K for temperature.
Some regions also show a larger interpercentile range for the GAN, mostly over land for wind and over sea for t2m. Others show a narrower range, mostly the Rhône valley for u and the mountains for t2m. The interpercentile range of the GAN is closer to the one of AROME for temperature than for wind components, where it can be as large as 100%. This supports the fact that the GAN is probably influenced by the positional nature of temperature data.
The average bias of the GAN over all grid points is close to zero for all variables and statistics, except for Q10 on t2m. Localized, stronger biases exist, however, and they depend on the location as well as on the variable considered. This further supports that the GAN fits the distribution of values, including relatively extreme ones, in an unbiased manner but can locally exhibit strong deviations from the AROME-EPS distribution, as shown in cross-variable sections.
As a partial conclusion for this section, it has been shown that the GAN has achieved very good quality in terms of sample realism and diversity, power spectrum reproduction, and joint distribution recovery. Moreover, the GAN can generate thousands of samples in a time range of seconds (inference time is around 60 s for 16 384 samples), making the approach interesting in an ensemble generation framework. Finally, the set of metrics used has been shown to provide a detailed, complementary view of the GAN’s capabilities and weaknesses.
5. Multivariate configurations: A comparison
The impact of multivariate generation on training now has to be estimated in order to assess whether adding variables helps the GAN identify useful correlations or just makes the task more difficult. The experiment is conducted with four different configurations (summed up in Table 4):
Baseline configuration using (u, υ, t2m) as generated fields.
Configuration (Config.) 1: Removing the t2m field and keeping only the generation of the (u, υ) couple.
Config. 2: Removing the (u, υ) couple and keeping the generation of t2m field.
Config. 3: Adding the generation of orography to Config. 2. The constant field of orography is generated by the GAN and also taken into account by the discriminator. This is done to test whether explicitly adding information related to position adds value for the generation of temperature (which is more correlated to position than the wind).
Summary of the hyperparameters selected for the multivariate experiments.
For each configuration, the best hyperparameters are selected within the previously used parameter range for lr0 and BS. This assessment is made on averaging three runs’ scores with on-the-fly validation metrics (W1,r/c and SWDmulti) for 4096-sample batches, completed by visual inspection of samples. This allows for fast and reliable selection of hyperparameters, which are summed up in Table 4. Once these are selected, evaluation is performed on another set of three runs for each selected configuration. This final evaluation is performed with batches of 16 384 samples, and all the previously described metrics are used to yield the most extensive evaluation. The results are reported in Tables 5 and 6.
Global score card to compare multivariate experiments. Reported scores correspond to the average best score obtained after training saturation for the three runs. For all metrics considered, lower is better. Better scores with respect to the baseline are shown in bold; worse scores are in italic. A — is used when the metric is not applicable to the configuration.
Mean absolute error for correlation length maps. Reported scores correspond to the best average score obtained after training saturation for the three runs. For all metrics considered, lower is better. Better scores with respect to the baseline are shown in bold; worse scores are in italic. A — is used when the metric is not applicable to the configuration.
Tables 5 and 6 show that a general effect of reducing the number of generated variables is an increasing performance on most metrics related to spatial consistency (PSD, correlation lengths, scattering metrics). Another interesting pattern is that scores of Config. 3 (t2m and orography) are generally worse than those of Config. 2 (t2m only), and sometimes even than baseline. Adding orography information thus seems to have a mixed effect. On the one hand, it degrades the synthesis of temperature spatial structures, as emphasized by PSD error, correlation length error, and scattering metrics. On the other hand, W1,r/c scores, as well as the SWD16 scores, are sensibly improved when orography is added. Removing temperature and orography and keeping wind variables have an opposite effect. Indeed, Config. 1 shows improved spectral, scattering, and correlation lengths metrics, while W1,r/c scores slightly degrade and SWDmulti scores dramatically degrade. This shows a lack of ability of the GAN to capture the diversity of patterns in the dataset, while improving the quality of individual samples. The largest distributional discrepancy of Config. 1 even hints at some form of mode collapse.
Altogether, the quality of samples also improves when comparing Configs. 1 and 2 to baseline (Fig. 13). This is consistent with the majority of metrics involving spatial consistency. Especially, scattering metrics show a drastic improvement when reducing the number of variables. The GAN is, therefore, much more able to identify and generate multiscale organization in the samples, albeit at the expense of pattern diversity. This points to the GAN using cross-variable correlations to improve the diversity, rather than the quality, of samples.
6. Learning absolute gridpoint position: Analysis and consequences
The above experimental results give a set of observations that can be exploited to diagnose the strengths and flaws of the training design and their interaction with neural network architecture.
Let us summarize some of them:
-
The error signal at the beginning of the training is strongly correlated to the pixelwise variance of the dataset (section 4).
-
Large-scale EMDs are far worse than small-scale EMDs in all configurations (sections 4 and 5).
-
Performance for correlation length scales is far better on land than over the sea (section 4).
Given that learning is performed on a fixed spatial domain, it is very likely that the main source of information for learning in our setup is the implicit encoding of absolute gridpoint position. This phenomenon is already acknowledged in literature for convolutional networks (Alsallakh et al. 2021; Zhang 2019) and has been extensively studied in the case of GANs by Xu et al. (2020). It is usually explained by the use of padding in convolutional layers: adding rows and columns of zeros in the intermediate layers allows the network to detect the boundaries of the feature maps and, thus, implicitly infer the position of each pixel.
The present setup goes a step further by adding variables that are more or less directly correlated to surface state and, thus, to absolute gridpoint position. Temperature variability is mostly position-related over land, as it is obviously the case for orography and, also, although moderately, for 10-m wind. Over sea, this position-related information fades out, while transient features, such as wind fronts, are prominent: the bias is weaker and the GAN struggles to generate correct correlation structures.
This positional bias probably plays a key role of dedicating most of the networks’ power to extract and fit position-related features. Hence, this analysis is a plausible explanation for another set of observations:
-
Right from the baseline configuration (where no explicit position is given to the network), the reconstruction of temperature correlation with altitude is very accurate. In this case, learning position-related features with fine-grained spatial detail is largely helpful.
-
Adding orography as a constant field to generate is largely detrimental to the sample quality, according to the scores used, but improves the largest scales of multiscale SWD (section 5). Positional information at large scales gives information about the overall structure of the field and is then most useful to generate the right distribution of patterns. Conversely, reinforcing this bias through orography accelerates position-based overfitting.
-
The GAN is not able to use cross-variable correlations to improve individual sample quality but maintains more diversity (section 5): the added information of each variable in the baseline configuration is likely redundant if it is only position related. This prevents the GAN from improving, by using the less redundant, transient features which differentiate the three variables.
-
The GAN trades off diversity for quality in the wind-only configuration (section 5). This configuration is the one where positional information has least weight. It is, thus, probable that reducing the positional bias makes the GAN focus on transient features’ quality that play a larger role in discrimination while relaxing the diversity constraint and narrowing the distribution of patterns. This is likely guided by large-scale pattern detection being harder without positional bias, as evoked in point 2.
These explanations imply that the setup is prone to overfitting and that, contrary to the common assumption, increasing batch size will degrade the performance of the GAN. We thus conduct a final analysis using different initial learning rates and batch sizes (BS ∈ {32, 64, 128, 256, 512}, lr0 ∈ {4 × 10−4, 2 × 10−3, 4 × 10−3}). Using the baseline configuration, we train the GAN from scratch for each pair of learning rate and batch size, up to loss saturation. We first observe that loss saturation occurs earlier with increasing batch size (cf. appendix C), underpinning the above hypothesis. Figure 14 shows the relative degradation or improvement of metrics with respect to the BS = 32 configuration for each learning rate. Once saturated, W1,c only slightly increases with batch size at all learning rates. On the other hand, quality metrics such as PSD and s21 and s22 errors drastically degrade when batch size increases (up to ×5 and ×10 degradation for PSD). SWDavg follows a path in between. The effect is less pronounced with diminishing learning rates, but it was observed that smaller learning rates degrade scores globally (cf. appendix C).
Reproducing pixelwise, variable-wise distributions is thus rather an “easy” mode of convergence, achievable for most GAN configurations. Increasing batch size then likely strengthens the position-learning dynamics. The GAN rapidly memorizes position-related features and more or less forgets about the transient structures, which are smoothed out by large batches.
This set of explanations is consistent with the hypothesis of absolute gridpoint position learning. It also echoes classical quality-diversity trade-offs encountered in GANs (Radford et al. 2015; Zhang et al. 2019; Brock et al. 2018) and the general perception-distortion trade-off of generative models (Blau and Michaeli 2018); in our case, the trade-off is balanced by the variables fed to the GAN and the amount of positional information they contain. Learning on a fixed domain is a crucial component of the present setup, which explains both the good quality of individual samples and gridpoint distributions, as well as the scale-dependent diversity fit.
Whether such positional bias should be learned is an open question in general. As was shown by temperature correlation with altitude, this is very helpful in weather-related tasks, where many features depend on the absolute position. The way it is added in the training process, through architectural features and the data themselves, can thus be clearly framed and controlled with a priori heuristics. As an example, the positional bias could be strongly attenuated with variables that are much less dependent on absolute position (e.g., temperature at 850 hPa). This could also be the case if the networks were trained on random domain patches while conditioning both networks through orography. In this case, the task is strictly more difficult as the diversity of the dataset increases. We performed some preliminary experiments in this randomized setup: while it seems that one, indeed, gets rid of positional bias (notably, increasing batch size improves SWDmulti and PSD errors), this remains to be detailed and confirmed in future work.
It is also possible that more sophisticated architectures such as ProGAN (Karras et al. 2018) and StyleGAN (Karras et al. 2019, 2020), which explicitly handle scale-dependent pattern generation and disentangle features, perform notably better on the same setup.
7. Conclusions and perspectives
In this paper, meaningful metrics have been developed and applied to assess the ability of a GAN to emulate outputs of the kilometer-scale AROME-EPS weather forecasting system. From the above evaluation, the following conclusions can be drawn:
-
Multiple metrics, borrowed either from weather science or computer vision, are necessary to diagnose the ability of a GAN to consistently generate weather states. Namely, going beyond a mere spectral analysis to describe the quality of generated samples proved useful. Multiscale SWD was successful in characterizing the diversity evolution with scale; scattering coefficients were used to assess the consistency of structures, while the local correlation length scales enabled a positionwise analysis of correlation reconstruction.
-
A residual GAN architecture is plainly able to generate multivariate distributions of NWP models at kilometer scale. In particular, it can reproduce detailed textures as well as long-range events with a good diversity. State-of-the-art regularization techniques and networks are necessary for this task, and their training parameters must be carefully set to avoid divergence.
-
A study of multivariate generation was performed to probe the effects of adding and removing variables to the training setup. An important phenomenon happening in our GAN was characterized: the positional bias, induced both by padding and surface variables such as 2-m temperature, is a prominent drive of the learning process. This is a double-edged sword: it enables fast convergence and emulation of crucial features such as temperature correlation with altitude while it degrades the ability of the GAN to generate high-quality transient structures and accelerates the occurrence of overfitting. This seems to be part of the wider quality-diversity trade-off, whose major component in the present case is given by the specific weather variables used.
Precipitation has not been addressed in this work, because of the extremely skewed nature of its distribution, specifically the overwhelming class of zero-precipitation days. Resampling techniques (Sha et al. 2020; Ravuri et al. 2021) would arguably tackle this point, but the relatively small size of the database discouraged us to go further in this direction on a first trial. This is a natural path for future work.
Another promising path is the generation of states based on the current weather situation. This framework has been used by many downscaling studies (Leinonen et al. 2021; Harris et al. 2022) that take low-resolution data as inputs to generate ensembles of high-resolution outputs. One could then ask whether a GAN framework could be used to increase the size of operational ensemble forecasts at a minimal computing cost. Open challenges would then be the precise way to condition the GAN with ensemble outputs at the same resolution, as well as the ability to control GAN-produced outliers. We believe the results shown in this study are encouraging enough to go further in this direction.
Finally, a largely unexplored path is the production of temporal sequences of forecasts at the lead times usually covered by the operational NWP models (up to 48–72 h). While it remains open whether the GAN framework is adapted to such a task, this would be a necessary step in order to use data-driven, high-resolution, real-time ensemble emulation.
Acknowledgments.
We deeply thank Léa Berthomier and Bruno Pradel for their thorough technical support with Météo-France’s computing infrastructure. We thank Camille Besombes, Olivier Pannekoucke, Ronan Fablet, Arnaud Mounier, and Thomas Rieutord for insightful discussions. This work was performed as part of the ANR Project 21-CE46-007 “Probabilistic Prediction of Extreme Weather Events Based on AI/Physics Synergy (POESY)” led by one of the authors (L.R.). One of the authors (C.B.) performed this work during his Ph.D. program, which was funded by the French Ministère de la Transition Ecologique as part of the Ph.D. program of the Ingénieurs des Ponts, Eaux et Forêts Civil Corps.
Data availability statement.
The code used to train networks and analyze data through all metrics can be found at https://github.com/flyIchtus/multivariate-GAN. The AROME-EPS dataset used in this study will be made available at the end of the Agence Nationale de la Recherche (ANR) project.
APPENDIX A
Implementation Details
The implementation of our GAN makes use of several techniques acknowledged to either facilitate convergence or accelerate computing. Here are reported the ones that were helpful. The code is written with PyTorch (Paszke et al. 2019), using the multi-GPU Horovod API (Sergeev and Balso 2018).
-
The residual blocks we use follow usual guidelines of literature (Miyato et al. 2018; Besombes et al. 2021; Ravuri et al. 2021). The main block consists of two stacked 3 × 3 convolutions followed with LeakyReLU and BatchNorm, with a bilinear upscale and downscale layer. Either a 1 × 1 convolution or a direct sum is used as residual shortcut.
-
We use automatic mixed precision (AMP), casting most operations to half precision. This leads to a dramatic acceleration of training and slashes memory consumption, keeping all runs shorter than 12 h and leaving space for later development of architectures. Unfortunately, this also comes with stability issues: some of the runs produced cases of not a number (NaN) at their very beginning, with specific hyperparameter configurations being completely hampered by AMP while running smoothly with simple precision.
-
It was found that the discriminator’s gradients or the failed cases were violently oscillating at the beginning of training. We then introduced a small warm-up procedure whereby the discriminator was updated several times for one update of the generator. Choosing an update ratio of 5, as in Gulrajani et al. (2017), on the single first generator step considerably reduced the oscillations of gradients and made training stable for most of the 180 runs conducted for this study.
-
Initialization of the neural networks weights has been shown to be an important factor for training convergence (Bengio and Glorot 2010). Here we keep the default random initialization for all linear layers while using orthogonal initialization for all convolutional layers. Besides generally having a beneficial impact on training (Saxe et al. 2014), this naturally helps the spectral normalization regularization we adopted by starting with already spectrally normalized (random) weights.
APPENDIX B
Detailed Description of Metrics
a. Pixelwise Earth-mover distance
b. Sliced Wasserstein distance
Here, we draw extensively from the work of Rabin et al. (2011) and Karras et al. (2018). The Wasserstein distance is known to be an informative metric for distributions (Arjovsky et al. 2017) yet is computationally intractable and exhibits large variance in high-dimensional spaces (Ramdas et al. 2015). Monte Carlo approximation of such a distance is defined in Rabin et al. (2011) and is an unbiased and robust way to cope with the burden of W1 estimation in high-dimensional spaces (Kolouri et al. 2018).
For multilevel estimation, we first decompose the original field into a Laplacian pyramid, from finest to coarsest scales. The process then consists of selecting a batch of random neighborhoods of 7 × 7 pixels and normalizing each variable of the samples with respect to batch and spatial mean and standard deviation. The distributions of neighborhoods from the GAN and from AROME-EPS samples are finally compared with the help of SWD. Since neighborhoods include several pixels, they have several degrees of freedom; thus, SWD is a way to estimate the multidimensional EMD on these neighborhoods. We use the parameters of Karras et al. (2018) without modification, with 512 unit directions for SWD and 128 random neighborhoods for each level.
c. Scattering coefficients
Obtaining scattering coefficients consists of successive convolutions with wavelet filter banks
To provide estimates of absolute values for RMSE on scattering coefficients and SWDmulti, Table B1 summarizes the scores obtained by the Gaussian field maps and the baseline configuration. Note that since the samples are normalized before applying the RMSE, all variables share a common scale of values. This table shows that the GAN performs sensibly better than a Gaussian field in any configuration and that a Gaussian field does not recover the correct distribution of patterns as measured by SWDmulti.
Comparison between scores against AROME-EPS dataset for Gaussian field and the baseline GAN configuration. Scattering estimators have been averaged over all three variables for baseline configuration.
APPENDIX C
Hyperparameter Search
Experiments were carried out on five different batch sizes (BS ∈ {32, 64, 128, 256, 512}) and three different initial learning rates (lr0 ∈ {4 × 10−4, 2 × 10−3, 4 × 10−3}). Each configuration was run three times to account for training variability.
For all configurations, the discriminator loss curves exhibit a deep trough followed by a slower ascent, up to a value below 2.0, after which the loss plateaus; meanwhile, the generator loss produces bumps after an abrupt decrease, before oscillating around 0.0 when the D loss reaches its maximum level. Figure C1 exposes some examples of this behavior. The converged regime corresponds to situations where D is unable to separate the AROME-EPS and the GAN samples: it is likely to indicate convergence of the algorithm on a local minimum. Using BS ≥ 256 provokes rapid saturation of the losses, indicating early stagnation of learning. Reducing the learning rate attenuates this effect and lengthens the ascent part. The BS = 32 experiment did not reach the plateau for any of the learning rates but the highest, indicating that batch size and learning rate both control the learning speed. Another direct effect of batch size increase is the reduction of loss oscillations.
At the point where the D loss reaches saturation, training is completed for most runs, as our control metrics often saturate (not shown). In some cases, some components of SWDmulti tend to slightly increase after the plateau, indicating possible overfitting. A score card for all the tested hyperparameter configurations can then be drawn. For each configuration, the best (i.e., lowest) score obtained is selected, for all metrics, after D loss saturation when it happens or the best score altogether. Results are reported in Table C1.
Scores obtained by each configuration for our metrics panel. Reported scores correspond to the best score obtained after training saturation for the three runs and are given in the order average/best/worst. For all metrics considered, lower is better. For a given configuration and a given run, all “best scores after saturation” do not necessarily correspond to the same step for all metrics. Overall the best scores are in bold; the worst scores are in italic.
Trying to select the best-performing configuration from this table, one can rule out the BS = 512 configurations, mainly because of high PSD errors. This corresponds to a very low visual quality of samples (blurry images with checkerboard artifacts). Small batch sizes seem to perform better than others, especially with respect to EMD and SWD metrics, even if they do not always perform best on PSD errors. Moreover, the score spread for each configuration is rather wide and it is not rare that different configurations produce scores on overlapping ranges. Altogether, lr0 = 4 × 10−3 seems to perform better than any other, with the BS = 32 configuration scoring best on a wide range of metrics.
Only the most successful configurations of Table C1 are kept to get a view of their EMD-scattering scores and check their ranking. These observations are summed up in Table C2. The ranking slightly changes: while increasing batch size degrades the scores similarly to previous experiments, an intermediate learning rate of 2 × 10−3 produces the best scores obtained. All experiments that are not reported in Table C2 show worse scores than the ones shown.
Scattering RMSE estimators for each variable and averaged over the three runs. Best results are in bold. Shown are only the most successful configurations as results from Table C1. The score ranking previously obtained with PSD and SWD slightly changes with this metric.
APPENDIX D
Does the GAN Copy the Dataset?
An important consistency check consists of verifying that the GAN does not memorize the training samples and is able to generate sufficiently different samples. To examine this aspect, we use the mean square error (MSE) and look for the pair of GAN and AROME-EPS samples with the lowest global (i.e., including all variables and grid points) MSE distance across both GAN-generated and AROME-EPS datasets. Such samples are shown on Fig. D1 and exhibit a noticeable visual difference. The distribution of global MSE distance of this specific GAN sample with the whole AROME-EPS dataset is also plotted. As can be seen in Fig. D2, this distribution peaks at about 0.1, which is approximately the variance of the normalized AROME-EPS dataset. The MSE minimum is thus sensibly distinct from any AROME sample while being at a consistent average MSE distance from the whole dataset, further confirming the absence of mode collapse.
REFERENCES
Alsallakh, B., N. Kokhlikyan, V. Miglani, J. Yuan, and O. Reblitz-Richardson, 2021: Mind the pad—CNNs can develop blind spots. ICLR 2021: Int. Conf. on Learning Representations, Vienna, Austria, ICLR, https://openreview.net/pdf?id=m1CD7tPubNy.
Andén, J., and S. Mallat, 2014: Deep scattering spectrum. IEEE Trans. Signal Process., 62, 4114–4128, https://doi.org/10.1109/TSP.2014. 2326991.
Andreux, M., and Coauthors, 2018: Kymatio: Scattering transforms in Python. arXiv, 1812.11214v3, https://doi.org/10.48550/ARXIV.1812.11214.
Arjovsky, M., S. Chintala, and L. Bottou, 2017: Wasserstein generative adversarial networks. Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, PMLR, 214–223, https://proceedings.mlr.press/v70/arjovsky17a.html.
Bengio, Y., and X. Glorot, 2010: Understanding the difficulty of training deep feed forward neural networks. Proc. 13th Int. Conf. on Artificial Intelligence and Statistics, Sardinia, Italy, PMLR, 249–256, https://proceedings.mlr.press/v9/glorot10a.html.
Besombes, C., O. Pannekoucke, C. Lapeyre, B. Sanderson, and O. Thual, 2021: Producing realistic climate data with generative adversarial networks. Nonlinear Processes Geophys., 28, 347–370, https://doi.org/10.5194/npg-28-347-2021.
Bhatia, S., A. Jain, and B. Hooi, 2021: ExGAN: Adversarial generation of extreme samples. arXiv, 2009.08454v3, https://doi.org/10.48550/arXiv.2009.08454.
Bihlo, A., 2020: A generative adversarial network approach to (ensemble) weather prediction. arXiv, 2006.07718v1, https://doi.org/10.48550/arXiv.2006.07718.
Blau, Y., and T. Michaeli, 2018: The perception-distortion tradeoff. 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, IEEE, 6228–6237, https://doi.org/10.1109/CVPR.2018.00652.
Bouttier, F., B. Vié, O. Nuissier, and L. Raynaud, 2012: Impact of stochastic physics in a convection-permitting ensemble. Mon. Wea. Rev., 140, 3706–3721, https://doi.org/10.1175/MWR-D-12-00031.1.
Bouttier, F., L. Raynaud, O. Nuissier, and B. Ménétrier, 2016: Sensitivity of the AROME ensemble to initial and surface perturbations during HyMeX. Quart. J. Roy. Meteor. Soc., 142, 390–403, https://doi.org/10.1002/qj.2622.
Brock, A., J. Donahue, and K. Simonyan, 2018: Large scale GAN training for high fidelity natural image synthesis. arXiv, 1809.11096v2, https://doi.org/10.48550/arXiv.1809.11096.
Brousseau, P., L. Berre, F. Bouttier, and G. Desroziers, 2011: Background-error covariances for a convective-scale data-assimilation system: Arome–France 3D-Var. Quart. J. Roy. Meteor. Soc., 137, 409–422, https://doi.org/10.1002/qj.750.
Bruna, J., and S. Mallat, 2013: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35, 1872–1886, https://doi.org/10.1109/TPAMI.2012.230.
Cheng, S., and B. Ménard, 2021: How to quantify fields or textures? A guide to the scattering transform. arXiv, 2112.01288v1, https://doi.org/10.48550/ARXIV.2112.01288.
Cheng, S., Y.-S. Ting, B. Ménard, and J. Bruna, 2020: A new approach to observational cosmology using the scattering transform. Mon. Not. Roy. Astron. Soc., 499, 5902–5914, https://doi.org/10.1093/mnras/staa3165.
Denis, B., J. Côté, and R. Laprise, 2002: Spectral decomposition of two-dimensional atmospheric fields on limited-area domains using the discrete cosine transform (DCT). Mon. Wea. Rev., 130, 1812–1829, https://doi.org/10.1175/1520-0493(2002)130<1812:SDOTDA>2.0.CO;2.
Descamps, L., C. Labadie, A. Joly, E. Bazile, P. Arbogast, and P. Cébron, 2015: PEARP, the Météo-France short-range ensemble prediction system. Quart. J. Roy. Meteor. Soc., 141, 1671–1685, https://doi.org/10.1002/qj.2469.
Dumoulin, V., I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, 2016: Adversarially learned inference. arXiv, 1606.00704v3, https://doi.org/10.48550/ARXIV.1606.00704.
Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 51–64, https://doi.org/10.1002/met.25.
Gagne, D. J., II, H. M. Christensen, A. C. Subramanian, and A. H. Monahan, 2020: Machine learning for stochastic parameterization: Generative adversarial networks in the Lorenz ’96 model. J. Adv. Model. Earth Syst., 12, e2019MS001896, https://doi.org/10.1029/2019MS001896.
Garcia, G. B., M. Lagrange, I. Emmanuel, and H. Andrieu, 2015: Classification of rainfall radar images using the scattering transform. 2015 23rd European Signal Processing Conf. (EUSIPCO), Nice, France, IEEE, 1940–1944, https://doi.org/10.1109/EUSIPCO.2015.7362722.
Goodfellow, I. J., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, 2014: Generative adversarial networks. arXiv, 1406.2661v1, https://doi.org/10.48550/ARXIV.1406.2661.
Gulrajani, I., F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, 2017: Improved training of Wasserstein GANs. arXiv, 1704.00028v3, https://doi.org/10.48550/arXiv.1704.00028.
Harris, L., A. T. T. McRae, M. Chantry, P. D. Dueben, and T. N. Palmer, 2022: A generative deep learning approach to stochastic downscaling of precipitation forecasts. arXiv, 2204.02028v2, https://doi.org/10.48550/ARXIV.2204.02028.
Karras, T., T. Aila, S. Laine, and J. Lehtinen, 2018: Progressive growing of GANs for improved quality, stability, and variation. arXiv, 1710.10196v3, https://doi.org/10.48550/arXiv.1710.10196.
Karras, T., S. Laine, and T. Aila, 2019: A style-based generator architecture for generative adversarial networks. arXiv, 1812.04948v3, https://doi.org/10.48550/arXiv.1812.04948.
Karras, T., S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, 2020: Analyzing and improving the image quality of StyleGAN. arXiv, 1912.04958v2, https://doi.org/10.48550/arXiv.1912.04958.
Kingma, D. P., and M. Welling, 2014: Auto-encoding variational Bayes. arXiv, 1312.6114v11, https://doi.org/10.48550/arXiv.1312.6114.
Kingma, D. P., and J. Ba, 2015: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.
Kingma, D. P., and M. Welling, 2019: An introduction to variational autoencoders. Found. Trends Mach. Learn., 12, 307–392, .
Kolouri, S., P. E. Pope, C. E. Martin, and G. K. Rohde, 2018: Sliced-Wasserstein autoencoder: An embarrassingly simple generative model. arXiv, 1804.01947v3, https://doi.org/10.48550/arXiv.1804.01947.
Kynkäänniemi, T., T. Karras, S. Laine, J. Lehtinen, and T. Aila, 2019: Improved precision and recall metric for assessing generative models. arXiv, 1904.06991v3, https://doi.org/10.48550/arXiv.1904.06991.
Leinonen, J., D. Nerini, and A. Berne, 2021: Stochastic super-resolution for downscaling time-evolving atmospheric fields with a generative adversarial network. IEEE Trans. Geosci. Remote Sens., 59, 7211–7223, https://doi.org/10.1109/TGRS.2020.3032790.
Lim, J. H., and J. C. Ye, 2017: Geometric GAN. arXiv, 1705.02894v2, https://doi.org/10.48550/arXiv.1705.02894.
Mallat, S., 2012: Group invariant scattering. Commun. Pure Appl. Math., 65, 1331–1398, https://doi.org/10.1002/cpa.21413.
Marin, I., S. Gotovac, M. Russo, and D. Božić-Štulić, 2021: The effect of latent space dimension on the quality of synthesized human face images. J. Commun. Software Syst., 17, 124–133, https://doi.org/10.24138/jcomss-2021-0035.
Mescheder, L., A. Geiger, and S. Nowozin, 2018: Which training methods for GANs do actually converge? arXiv, 1801.04406v4, https://doi.org/10.48550/ARXIV.1801.04406.
Miyato, T., T. Kataoka, M. Koyama, and Y. Yoshida, 2018: Spectral normalization for generative adversarial networks. arXiv, 1802.05957v1, https://doi.org/10.48550/arXiv.1802.05957.
Montmerle, T., Y. Michel, E. Arbogast, B. Ménétrier, and P. Brousseau, 2018: A 3D ensemble variational data assimilation scheme for the limited-area AROME model: Formulation and preliminary results. Quart. J. Roy. Meteor. Soc., 144, 2196–2215, https://doi.org/10.1002/qj.3334.
Mustafa, M., D. Bard, W. Bhimji, Z. Lukić, R. Al-Rfou, and J. M. Kratochvil, 2019: CosmoGAN: Creating high-fidelity weak lensing convergence maps using generative adversarial networks. Comput. Astrophys. Cosmol., 6, 1, https://doi.org/10.1186/s40668-019-0029-9.
Odena, A., C. Olah, and J. Shlens, 2017: Conditional image synthesis with auxiliary classifier GANs. arXiv, 1610.09585v4, https://doi.org/10.48550/arXiv.1610.09585.
Olea, R. A., 1994: Fundamentals of semivariogram estimation, modeling, and usage. Stochastic Modeling and Geostatistics: Principles, Methods, and Case Studies, J. M. Yarus and R. L. Chambers, Eds., American Association of Petroleum Geologists, 27–36, https://doi.org/10.1306/CA3590.
Pannekoucke, O., L. Berre, and G. Desroziers, 2008: Background-error correlation length-scale estimates and their sampling statistics. Quart. J. Roy. Meteor. Soc., 134, 497–508, https://doi.org/10.1002/qj.212.
Pantillon, F., P. Knippertz, and U. Corsmeier, 2017: Revisiting the synoptic-scale predictability of severe European winter storms using ECMWF ensemble reforecasts. Nat. Hazards Earth Syst. Sci., 17, 1795–1810, https://doi.org/10.5194/nhess-17-1795-2017.
Paszke, A., and Coauthors, 2019: PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, Vol. 32, H. Wallach et al., Eds., Curran Associates Inc., 8024–8035.
Ponzano, M., B. Joly, L. Descamps, and P. Arbogast, 2020: Systematic error analysis of heavy-precipitation-event prediction using a 30-year hindcast dataset. Nat. Hazards Earth Syst. Sci., 20, 1369–1389, https://doi.org/10.5194/nhess-20-1369-2020.
Rabin, J., G. Peyré, J. Delon, and B. Marc, 2011: Wasserstein barycenter and its application to texture mixing. Scale Space and Variational Methods in Computer Vision (SSVM’11), Springer, 435–446.
Radford, A., L. Metz, and S. Chintala, 2015: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv, 1511.06434v2, https://doi.org/10.48550/arXiv.1511.06434.
Ramdas, A., N. Garcia, and M. Cuturi, 2015: On Wasserstein two sample testing and related families of nonparametric tests. arXiv, 1509.02237v2, https://doi.org/10.48550/ARXIV.1509.02237.
Ravuri, S., and Coauthors, 2021: Skilful precipitation nowcasting using deep generative models of radar. Nature, 597, 672 –677, https://doi.org/10.1038/s41586-021-03854-z.
Raynaud, L., and O. Pannekoucke, 2013: Sampling properties and spatial filtering of ensemble background-error length-scales. Quart. J. Roy. Meteor. Soc., 139, 784–794, https://doi.org/10.1002/qj.1999.
Raynaud, L., and F. Bouttier, 2016: Comparison of initial perturbation methods for ensemble prediction at convective scale. Quart. J. Roy. Meteor. Soc., 142, 854–866, https://doi.org/10.1002/qj.2686.
Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 78–97, https://doi.org/10.1175/2007MWR2123.1.
Rubner, Y., C. Tomasi, and L. J. Guibas, 2004: The Earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis., 40, 99–121, https://doi.org/10.1023/A:1026543900054.
Saxe, A. M., J. L. McClelland, and S. Ganguli, 2014: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv, 1312.6120v3, https://doi.org/10.48550/arXiv.1312.6120.
Seity, Y., P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson, 2011: The AROME-France convective-scale operational model. Mon. Wea. Rev., 139, 976–991, https://doi.org/10.1175/2010MWR3425.1.
Sergeev, A., and M. D. Balso, 2018: Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv, 1802.05799v3, https://doi.org/10.48550/arXiv.1802.05799.
Sha, Y., D. J. Gagne II, G. West, and R. Stull, 2020: Deep-learning-based gridded downscaling of surface meteorological variables in complex terrain. Part II: Daily precipitation. J. Appl. Meteor. Climatol., 59, 2075–2092, https://doi.org/10.1175/JAMC-D-20-0058.1.
Vincendon, B., V. Ducrocq, O. Nuissier, and B. Vié, 2011: Perturbation of convection-permitting NWP forecasts for flash-flood ensemble forecasting. Nat. Hazards Earth Syst. Sci., 11, 1529–1544, https://doi.org/10.5194/nhess-11-1529-2011.
Weaver, A. T., and I. Mirouze, 2013: On the diffusion equation and its application to isotropic and anisotropic correlation modelling in variational assimilation. Quart. J. Roy. Meteor. Soc., 139, 242–260, https://doi.org/10.1002/qj.1955.
Xu, R., X. Wang, K. Chen, B. Zhou, and C. C. Loy, 2020: Positional encoding as spatial inductive bias in GANs. arXiv, 2012.05217v1, https://doi.org/10.48550/ARXIV.2012.05217.
Zhang, H., I. Goodfellow, D. Metaxas, and A. Odena, 2019: Self-attention generative adversarial networks. arXiv, 1805.08318v2, https://doi.org/10.48550/arXiv.1805.08318.
Zhang, R., 2019: Making convolutional networks shift-invariant again. Proc. 36th Int. Conf. on Machine Learning, Long Beach, CA, PMLR, 7324–7334, http://proceedings.mlr.press/v97/zhang19a.html.