Multivariate Emulation of Kilometer-Scale Numerical Weather Predictions with Generative Adversarial Networks: A Proof of Concept

Clément Brochet aEcole des Ponts Paris-Tech, Champs-sur-Marne, France
bCNRM, Université de Toulouse, Météo-France, CNRS, Toulouse, France

Search for other papers by Clément Brochet in
Current site
Google Scholar
PubMed
Close
,
Laure Raynaud bCNRM, Université de Toulouse, Météo-France, CNRS, Toulouse, France

Search for other papers by Laure Raynaud in
Current site
Google Scholar
PubMed
Close
,
Nicolas Thome cCNAM, Paris, France

Search for other papers by Nicolas Thome in
Current site
Google Scholar
PubMed
Close
,
Matthieu Plu bCNRM, Université de Toulouse, Météo-France, CNRS, Toulouse, France

Search for other papers by Matthieu Plu in
Current site
Google Scholar
PubMed
Close
, and
Clément Rambour cCNAM, Paris, France

Search for other papers by Clément Rambour in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Emulating numerical weather prediction (NWP) model outputs is important to compute large datasets of weather fields in an efficient way. The purpose of the present paper is to investigate the ability of generative adversarial networks (GANs) to emulate distributions of multivariate outputs (10-m wind and 2-m temperature) of a kilometer-scale NWP model. For that purpose, a residual GAN architecture, regularized with spectral normalization, is trained against a kilometer-scale dataset from the AROME Ensemble Prediction System (AROME-EPS). A wide range of metrics is used for quality assessment, including pixelwise and multiscale Earth-mover distances, spectral analysis, and correlation length scales. The use of wavelet-based scattering coefficients as meaningful metrics is also presented. The GAN generates samples with good distribution recovery and good skill in average spectrum reconstruction. Important local weather patterns are reproduced with a high level of detail, while the joint generation of multivariate samples matches the underlying AROME-EPS distribution. The different metrics introduced describe the GAN’s behavior in a complementary manner, highlighting the need to go beyond spectral analysis in generation quality assessment. An ablation study then shows that removing variables from the generation process is globally beneficial, pointing at the GAN limitations to leverage cross-variable correlations. The role of absolute positional bias in the training process is also characterized, explaining both accelerated learning and quality-diversity trade-off in the multivariate emulation. These results open perspectives about the use of GAN to enrich NWP ensemble approaches, provided that the aforementioned positional bias is properly controlled.

© 2023 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Clément Brochet, clement.brochet@meteo.fr

Abstract

Emulating numerical weather prediction (NWP) model outputs is important to compute large datasets of weather fields in an efficient way. The purpose of the present paper is to investigate the ability of generative adversarial networks (GANs) to emulate distributions of multivariate outputs (10-m wind and 2-m temperature) of a kilometer-scale NWP model. For that purpose, a residual GAN architecture, regularized with spectral normalization, is trained against a kilometer-scale dataset from the AROME Ensemble Prediction System (AROME-EPS). A wide range of metrics is used for quality assessment, including pixelwise and multiscale Earth-mover distances, spectral analysis, and correlation length scales. The use of wavelet-based scattering coefficients as meaningful metrics is also presented. The GAN generates samples with good distribution recovery and good skill in average spectrum reconstruction. Important local weather patterns are reproduced with a high level of detail, while the joint generation of multivariate samples matches the underlying AROME-EPS distribution. The different metrics introduced describe the GAN’s behavior in a complementary manner, highlighting the need to go beyond spectral analysis in generation quality assessment. An ablation study then shows that removing variables from the generation process is globally beneficial, pointing at the GAN limitations to leverage cross-variable correlations. The role of absolute positional bias in the training process is also characterized, explaining both accelerated learning and quality-diversity trade-off in the multivariate emulation. These results open perspectives about the use of GAN to enrich NWP ensemble approaches, provided that the aforementioned positional bias is properly controlled.

© 2023 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Clément Brochet, clement.brochet@meteo.fr

1. Introduction

Having access to large sets of weather forecasts or reforecasts is of plain importance in many applications. For instance, some fundamental and applied studies in weather science rely on large reforecasts of events, for example, to detect climatological trends on specific patterns such as extratropical depressions (Pantillon et al. 2017) or heavy precipitating events (Ponzano et al. 2020). Such reforecasts, like operational forecasts in many centers, are usually based on ensemble prediction systems (EPSs). However, the high computing and storage costs of these systems can limit their configuration to a few dozen members for convection-permitting, kilometer-scale ensembles. This is often not enough to accurately sample the future distributions of weather variables. As a consequence, emulation of forecasts at different scales can have potential applications for both climatological studies and operational weather forecasting.

Increasing the ensemble size without resorting to the costly option of running additional EPS members remains an open challenge. Existing solutions mainly rely on different flavors of neighborhood approaches (Roberts and Lean 2008; Ebert 2008) that are based on the assumption of locally homogeneous weather. An original approach has been proposed by Vincendon et al. (2011) to design perturbed precipitation forecasts by applying location and intensity perturbations to the deterministic forecast. In recent years, generative deep learning has emerged as a novel approach that can produce accurate synthetic data, and it has recently seen a broadening use by the NWP community. In particular, generative adversarial networks (GANs; Goodfellow et al. 2014) and variational autoencoders (VAEs; Kingma and Welling 2014) have already been used for several NWP applications (Bihlo 2020; Ravuri et al. 2021; Leinonen et al. 2021; Bhatia et al. 2021; Harris et al. 2022). Of particular interest is the recent work of Besombes et al. (2021) that demonstrates the ability of a GAN to emulate realistic atmospheric states (accounting for several variables at different atmospheric levels) when trained on outputs of a simple climate model at a relatively coarse resolution (≈300 km).

Generative techniques allow sample draws from a simple, latent distribution, which are then mapped into higher-dimensional spaces (e.g., the space of NWP model outputs). VAEs have an encoder-latent-decoder structure; their latent space is used both to embed the input samples and then to generate maximum-likelihood samples from a parameterized distribution. This setup is prone to creating noisy samples, as exemplified by Dumoulin et al. (2016), and the denoised samples can be blurry (Kingma and Welling 2019). GANs, on the other hand, are composed of two competing networks (named the generator and the discriminator). Once trained, the generator of a GAN usually produces highly detailed images (Radford et al. 2015; Karras et al. 2018).

Although the performances of GANs can be appealing, it can be challenging to stably train a GAN model (Goodfellow et al. 2014; Arjovsky et al. 2017) as they are commonly affected by several obstacles. The first one is mode collapse, which is the concentration of the samples produced toward a small portion of the training distribution and, in extreme cases, a single sample. This can be the case if the generator begins to reproduce a specific subset of the training set to anomalously sharp numerical precision (Radford et al. 2015). It is a kind of overfitting. A second difficulty is the sudden loss of quality of the generated samples. This can be due to inefficient feature extraction from the discriminator, or from discriminator’s overfitting (Brock et al. 2018). This polarity between samples quality and distribution recovery has been termed the quality-diversity trade-off (Brock et al. 2018) or, more recently, the perception-distortion trade-off (Blau and Michaeli 2018). Therefore, evaluating a GAN should focus on two main aspects: the intrinsic quality of the samples and the recovery of the main features of the training dataset; this requires specific metrics.

Building on the work of Besombes et al. (2021), the objective of the present article is to examine the ability of GANs to emulate multivariate outputs of NWP models at a kilometric resolution, close to the one studied by Ravuri et al. (2021). To the authors’ knowledge, this aspect has not been evaluated yet, and the sensitivity of GAN training advocates for a dedicated study. Two important questions will be addressed: Are GANs effectively able to emulate multivariate outputs with a proper representation of every spatial scale? How can one evaluate the diversity and realism of the outputs of a GAN trained on such data? This is a preliminary step before using GANs to enhance EPSs, although such a task is left for future work.

This study proposes the training of a residual, spectrally normalized Wasserstein–GAN (Miyato et al. 2018), using kilometer-scale model outputs from the AROME-EPS. The AROME-EPS dataset involves several fields exhibiting fine-scale variations, such as 10-m wind speed components and 2-m temperature. To analyze what effects different variables have on training, several configurations will be examined with distinct sets of variables. Borrowing from weather science and computer vision, a comprehensive set of metrics is considered to assess different aspects of the GAN’s outputs. The spatial structure of emulated fields is evaluated with spectral transforms, correlation length scales, and scattering coefficients. These metrics are complemented with distributional distances, using pixelwise Wasserstein distance (Besombes et al. 2021) and sliced Wasserstein distance (Rabin et al. 2011; Karras et al. 2018). With this set of metrics, a detailed view of the GAN’s capabilities and weaknesses is provided, and we assess its sensitivity to the choice of hyperparameters and to the chosen architecture and setup.

The outline of the paper is as follows. The dataset and choices made for the setup are detailed in section 2; this includes choice of data, network architecture, and implementation of the GAN training algorithm. Section 3 details the whole set of metrics to be used in evaluation, and the evaluation strategy. Section 4 presents the main results obtained for the joint emulation of three AROME variables with a GAN. Section 5 compares the results obtained when varying the number and nature of the AROME variables used as predictors. Section 6 discusses the results. Section 7 provides conclusions and opens some perspectives for future work.

2. Generating AROME forecasts with a GAN: Setup choices

a. Dataset and problem formulation

The dataset used is made of forecasts from the 16-member, 1.3-km resolution AROME-EPS (Bouttier et al. 2012; Raynaud and Bouttier 2016), covering about 17 continuous months, from 15 June 2020 up to 12 November 2021. AROME-EPS is the ensemble version of the AROME limited-area, convection-permitting model (Seity et al. 2011) used operationally at Météo-France. The version of AROME-EPS considered uses a 1.3-km grid resolution and produces outputs on a regular latitude–longitude grid at 0.025° resolution. Initial conditions are built using the AROME 3D-Var analysis (Brousseau et al. 2011) and perturbations from the AROME Ensemble Data Assimilation (Montmerle et al. 2018), while lateral boundary conditions are given by forecasts of the global ARPEGE-EPS model (Descamps et al. 2015). AROME-EPS also uses the stochastically perturbed parameterization tendencies (SPPTs) scheme (Bouttier et al. 2012) and surface perturbations (Bouttier et al. 2016).

AROME-EPS 16-member forecasts are launched daily at 2100 UTC, and the first 24 h of prediction with 3-hourly outputs are used for training. The fields considered are the two horizontal directions of wind at 10-m height, referred to as u (the zonal component) and υ (the meridional component), and the 2-m temperature (t2m). Additionally, as will be detailed in section 5, orography is used in some experiments as a constant field for the GAN to generate.

To provide a flexible experimental setup and to keep the training runs in reasonable time windows, a small subregion of the AROME domain is selected. It corresponds to the Mediterranean coastal region and the Rhône valley (see Fig. 1). The subdomain considered thus spans 128 × 128 grid points, with an approximate 330-km side size. This choice of localization is motivated by the variable terrain features and identifiable weather patterns known to occur in this region. The joint presence of the French Alps and Mediterranean Sea, as well as marked episodes of strong northerly winds (mistral events) and heavy precipitating events characterized by localized strong gradients, both made an interesting case for trials on this region. Moreover, the high resolution of the samples allows for investigation of several scales of variability, from the regional scale down to the typical grid scale of state-of-the-art convection-permitting models.

Fig. 1.
Fig. 1.

(left) The full AROME domain is shown, along with the 128 × 128 subdomain used for the GAN training. (top right) Its main geographical features are represented on a topographic map (altitude; m). (bottom right) Dataset organization with its main variability directions. Each sample is a 3 × 128 × 128 array, corresponding to a “volume element” of the “dataset box” shown at the bottom right.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

In the baseline configuration, fields of u, υ, and t2m for a given lead time, date, and ensemble member are learned jointly as part of the same sample. One single day of forecasts hence yields 8 × 16 = 128 distinct data samples. Altogether, the usable dataset is then composed of 66 048 samples. The shape of the GAN output tensors is then 3 × 128 × 128, with 3 being the number of variables and 128 × 128 the domain size.

Using a dataset from AROME-EPS increases the volume of training data compared to using the deterministic AROME forecasts over the same period. It is important to note that many of the samples are, indeed, correlated, whether they correspond to close lead times or to different members of a same forecast. However, for a given forecast, samples corresponding to different lead times and different ensemble members are physically distinct, namely because of the fine-scale variability of wind fields and of the diurnal cycle of temperature. The ensemble-based dataset thus exhibits an increased small-scale diversity compared to a deterministic dataset. It is possible, therefore, to view the ensemble as an NWP-specific data augmentation strategy, implemented upon a deterministic forecast system on a given period. This study will thus assess to what extent a GAN can recover this given, fixed distribution of high-resolution samples enriched by the EPS.

b. Choices for the GAN architecture

The GAN framework has been thoroughly investigated in recent years, and some guidelines have emerged to design efficient and reliable GAN training algorithms. Let us denote with data the distribution of the target dataset to be emulated (in our case, AROME-EPS). The purpose of the GAN is to provide a function G mapping a high-dimensional, a priori–defined distribution z onto data. The z is defined on a latent space Z taken as input to the deep network supporting G (the generator). Outputs from G are then given as inputs to the discriminator network D, which tries to distinguish between “fake” samples outputted by G(Z) and the “real” samples Xdata. The output of D is then a single scalar assigning a “score” to each sample.

The objective function must ensure that outputs from D confer high scores to the “real” distribution while maintaining low scores on outputs from G. With D being optimized, G then aims at producing samples confusing D, that is, obtaining high scores from D. An optimal training then results in D being unable to separate real samples from fake ones, though being optimally designed to separate them. Ideally, then, the distribution produced by G completely and correctly recovers data.

The GAN training framework exhibits convergence and stability issues. Notably, the well-known “mode collapse” problem consists of G concentrating the mass of [G(Z)] on a small part of the data distribution while “forgetting” about the rest. To tackle this phenomenon, Arjovsky et al. (2017) emphasize the need for the discriminator to be a smooth (Lipschitz) function so that it continuously separates fake and real distribution samples, and introduce the Wasserstein–GAN framework (WGAN), noting that the discriminator is trained to assess a Wasserstein distance between the fake and real distributions. Making the discriminator a Lipschitzian function of its input samples comes down to bounding its gradient. Several formulations of the GAN’s objective have since then implemented this regularization constraint, which effectively improved on the original GAN formulation. The guidelines of Miyato et al. (2018), that is, use spectrally normalized convolution layers, are used in this work. Spectral normalization (SN) consists of renormalizing the weight matrices of the discriminator to bring their highest singular value to one (hence the term spectral). SN thus naturally enforces the Lipschitz constraint while being more efficient than other techniques, such as gradient clipping (Arjovsky et al. 2017), which imposes an arbitrary upper boundary on the gradient, or gradient penalty (GP; Gulrajani et al. 2017), which penalizes the gradient when its norm deviates from unity.

To the authors’ knowledge, this study is one of the first to propose a WGAN–SN framework for geophysics applications. Among the studies dealing with the generation of atmospheric fields with GAN, only Ravuri et al. (2021) use SN, while others, such as Besombes et al. (2021) and Harris et al. (2022), keep using the WGAN–GP formulation. Miyato et al. (2018) only apply spectral normalization on discriminator layers, but literature since then has acknowledged the positive effect of using SN on both generator and discriminator (Brock et al. 2018). This double regularization is implemented in the present setup.

GANs can also produce unstructured or strongly corrupted samples, depending on what features from the dataset the discriminator is able to identify as crucial. This failure mode can happen in the absence of mode collapse or concomitantly to it (Brock et al. 2018; Mescheder et al. 2018). This makes the training of GANs difficult, even within the Wasserstein–GAN framework, and special care has to be devoted to the networks’ hyperparameters choice.

Finally, the “hinge-loss” objective formulation given by Lim and Ye (2017) is used, where both G and D are trained to minimize their loss:
L(D)=EXdata[max[0.1D(X)]]+EZz[max{0.1+D[G(Z)]}]L(G)=EZz[D[G(Z)]].
These quantities are minimized through stochastic gradient descent, using an Adam optimizer (Kingma and Ba 2015) to adapt the weights of the two networks. Practically, expectations are estimated by randomly drawing samples from the AROME-EPS dataset (for the data estimator) and from the z distribution. The parameters of D and G are then updated alternatively. We set z to a centered, normal distribution of dimension d = 64 with an identity covariance matrix. The choice of d was determined by following previous literature (Besombes et al. 2021; Mustafa et al. 2019). This dimension was fixed all along the study since Marin et al. (2021) indicate that this parameter might not have a significant influence on the quality of generation, provided it is not too small (typically, below 64).

A residual architecture is chosen, consisting of two residual nets directly taken from Miyato et al. (2018), as shown in Fig. 2. Starting from a 64-dimensional latent space, samples are shaped into feature maps whose resolution gradually increases as they go throughout the generator, to be finally outputted through a tanh layer. The discriminator follows a symmetric pattern downscale, though being one layer deeper and involving more channels before the last dense, output layer. It should be emphasized that although the networks are relatively shallow, their estimated receptive fields (i.e., the size of the input region which contributes to a given output pixel) are large enough to model long-range correlations. For example, a single residual block of the generator involves two 3 × 3 convolutions and one upsampling layer with a dilation factor of 2. This configuration gives a maximal local receptive field of 10 pixels for each residual block. Since we stack five of these blocks, the final, global receptive field spans beyond 128 pixels, which is the size of our samples. Therefore, the final sample is able to take into account each degree of freedom from the random input.

Fig. 2.
Fig. 2.

Network architectures of (a) generator and (b) discriminator. Input layers are at the bottom of the schematic and networks process tensors from bottom to top.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

Before training, the samples are rescaled so that the global minimum and maximum values of the dataset fit within the [−0.95, 0.95] range. This is to ensure that the hyperbolic tangent output of the generator can reach the dataset’s extremes and even go beyond these limits. The mean, minimum, and maximum values are precomputed over all grid points and all data samples. Supplementary training parameters and procedures (floating precision, warm-up, initialization) are detailed in appendix A. These models are trained for a fixed number of 60 000 update steps, on a cluster of four NVIDIA V100 graphics processing units (GPUs) with 32 GB of RAM. Models are thus trained for 4–12-h wall-clock time, depending on the batch size. A step is equivalent to one update of both networks after forward and backward pass. Thus, for two different batch sizes, this fixed number of steps allows for different numbers of epochs (i.e., sets of steps corresponding to 100% of the dataset seen by both networks).

3. Evaluation metrics

a. Distributional metrics

The training is monitored thanks to three estimates of Earth-mover distance (EMD; Rubner et al. 2004) between data and GAN=G(z). This metric allows for quantification of the proximity of two multidimensional distributions. The exact retrieval of the EMD is a hard problem in general, and approximate EMD estimators converge slowly, necessitating large number of samples to be accurate (Ramdas et al. 2015). However, when the data are univariate, one is left with the following:
W1(P,Q)=01|FP1(t)FQ1(t)|dt,
where FP and FQ are the cumulative distribution functions associated with P and Q, respectively.

Following the strategy of Besombes et al. (2021), two 1D-EMD estimates are computed. At each test step, a random number of pixels is sampled to evaluate the average of per-pixel, per-variable 1D EMDs; another average of 1D EMDs is also evaluated on a fixed number of pixels, covering the central 64 × 64 crop of the domain (and averaged over variables). These estimates are hereafter termed W1,r (random pixels) and W1,c (central crop) and are a global measure of the quality of the generation of marginal (per-pixel, per-variable) distributions.

A third estimate of EMD is taken from Karras et al. (2018) and termed multilevel sliced Wasserstein distance (SWDmulti). This estimate measures multidimensional EMDs of images at different resolutions. Precisely, it decomposes the image signals on four different resolution levels and generates a Laplacian pyramid: starting from the finest-grained level, each image is obtained from the previous by Gaussian filter convolution, difference, and subsampling. One then measures EMD on each of the levels obtained, for the joint distribution of all variables. These estimates go from the fine-grained level (where images have 128 × 128 dimensions and conserve small-scale fluctuations) to the coarse-grained level (where images only have 16 × 16 dimensions and conserve low-frequency fluctuations). They are used to compare the distribution of patterns at each level. The name sliced Wasserstein distance refers to the way the estimates of EMD on multidimensional spaces are performed. This estimation procedure is unbiased (Rabin et al. 2011; Kolouri et al. 2018) and robust as long as the number of samples is large enough. This yields a four-component distance (one component for each level): SWDmulti = (SWD128, SWD64, SWD32, SWD16). Following the intuition of Rabin et al. (2011), the SWDmulti metric follows a kind of “wavelet decomposition” approach, as it measures the discrepancy between two image distributions at several scales of fluctuations for local neighborhoods, merging the contributions of different variables. A visual explanation of this metric is shown in Fig. 3, and a detailed description is given in appendix B.

Fig. 3.
Fig. 3.

Schematics of SWDmulti computation. Starting from a base image (shown at left), several levels are recursively created via a Laplacian pyramid procedure. Random neighborhoods are then selected (small black squares) for each level, allowing for estimation of the EMD on these neighborhoods for each level, thanks to the SWD algorithm. Appendix B details this procedure.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

A lower bound for these distances is set by their estimate from the AROME-EPS to itself. While the theoretical distance is 0, estimating it through finite batches from the AROME-EPS dataset yields a positive result. This distance is estimated by a bootstrap procedure. We select several batches of 16 384 samples (with no replacement within one batch but with replacement from one batch to the other) and then compute the EMD estimates between pairs of these batches. The average value of the pair-EMD series is kept as the distance estimate (see Table 1). The 16 384 samples represent nearly 25% of the full dataset; thus, we deem that this procedure represents correctly the diversity of the dataset. The values obtained correspond to lower bounds for EMDs, as they reflect the internal variability of the dataset and account for finite sampling effects. If an EMD estimate reaches the lower-bound values, it can be said that the GAN dataset would be completely indistinguishable from the AROME-EPS dataset regarding this estimate.

Table 1.

Estimates of the EMD from the AROME-EPS dataset to itself. These are estimated from 32 independent selections of 16 384 batch pairs. Shown is average/standard deviation for the series of tested pairs.

Table 1.

b. Power spectral density error

The EMD estimates are completed with the average power spectral density (PSD) spectrogram obtained from AROME-EPS and GAN samples for each variable. A root-mean-square error (RMSE) is taken on the difference of spectrograms (logarithmic scale) to give each scale the same weight. It is measured in decibels and reads as follows:
PSDerr=(10log[PSDGAN]10log[PSDAROME])210(PSDGANPSDAROME1.0)2,
where the angle brackets denote the average over spectral scales. This compares the deviation of the GAN to the AROME spectral repartition of energy. The PSDerr metric is frequently used to assess the realism of GAN predictions (Leinonen et al. 2021; Ravuri et al. 2021; Harris et al. 2022), where errors of a few decibels are usually considered as good quality. Spectrograms are computed with discrete cosine transform (Denis et al. 2002) to avoid aliasing effects due to the nonperiodicity of our samples. While this metric provides a sound evaluation of sample quality, it does not provide a complete view of the organization of bidimensional fields or of multiscale interactions. Namely, Gaussian noise fields can be tuned to recover the exact 2D spectrum of any other, fixed 2D field (Bruna and Mallat 2013). On the other hand, evaluating the diversity of samples produced by the GAN requires metrics that can evaluate the proximity between probability distributions, rather than between individual samples.

Going further requires us to consider supplementary diagnostics in the evaluation procedure, which are presented in the remainder of this section.

c. Correlation lengths

The PSD metric has a known drawback: as a completely nonlocal metric, it cannot provide insight into the spatial distribution of field’s variability. In other words, it is not sensitive to complex, hierarchical textures. Accounting for local structures and cross-scale interactions is easier with metrics involving calculation of local quantities. An example is the local correlation length scales. The correlation length scale can be defined as the typical length scale over which a given field is spatially correlated to itself. Large correlation lengths on a given grid point indicate that this point is often part of long-range structures. This diagnostic is common in data assimilation (Pannekoucke et al. 2008; Raynaud and Pannekoucke 2013), where it can be used to fit the error covariance matrix. Correlation lengths are also related to semivariogram scores (Olea 1994), which are commonly used to evaluate the performance of meteorological models. According to Weaver and Mirouze (2013), this length scale can be given as
Lcorr=2Tr[gxy],
where gxy=E[xXyX] is the mean local metric tensor obtained from the normalized field X and Tr[⋅] represents the trace operator. The ∂xX (respectively ∂yX) notation corresponds to the spatial gradient of X in the zonal (respectively meridian) direction. These gradients are estimated from the grid data using finite differences; the expectation of ∂xXyX is then taken over the samples of a given dataset. The resulting gxy tensor can be evaluated on each grid point and for each variable, yielding maps of Lcorr for each variable. The trace operator makes Lcorr an isotropic quantity; further manipulation of the gxy tensor can provide insights about local anisotropy (Pannekoucke et al. 2008), but we chose to keep the simpler Lcorr quantity as an indicator of structure sizes.

d. Second-order scattering coefficients

Here, we present scattering coefficient approaches (Andén and Mallat 2014; Bruna and Mallat 2013; Cheng et al. 2020) and show how they can be used to extract useful information from our meteorological dataset. Scattering coefficients are derived from recursive wavelet filtering of the fields. They have already been applied to meteorological information classification with satisfactory results, for example, in Garcia et al. (2015). The calculation of first- and second-order scattering coefficients is detailed in appendix B. Let λ1 and λ2 denote scales with λ1 < λ2. The scales considered can go from two grid points (∼2.5 km) to 64 grid points (∼160 km), and wavelet filters ψ for first- and second-order coefficients have different orientations θ1, θ2.

The convolution of a field X with ψλ1,θ1, followed by modulus, yields a set of first-order scattering maps M1(λ1) and resulting first-order coefficients:
M1(λ1,θ1)=|Xψλ1,θ1|,S1(λ1)=M1(λ1,θ1),
where angle brackets denote spatial averaging and dependence to θ1 is omitted.

First-order scattering coefficients S1(λ1) relate to the intensity of signal energy at scale λ1. As indicated by Cheng et al. (2020), they play a similar role to spectral decomposition.

Convolving again with ψλ2,θ2 and taking modulus yields second-order maps and coefficients:
M2(λ1,λ2,θ1,θ2)=|M1(λ1,θ1)ψλ2,θ2|S2(λ1,λ2)=M2(λ1,λ2,θ1,θ2).
Second-order coefficients S2(λ1, λ2) with λ2 > λ1 measure the energy at λ2 of a signal which has already been filtered to show variations of scale λ1 only. This accounts for the organization of patterns of typical scale λ1 on a larger scale λ2. They give an average effect of multiscale interactions.

Figure 4 shows the chain of transforms: starting from a full AROME image, successive scattering steps are applied. Regions with sharp wind gradients are highlighted by the first pass of wavelet at scale λ1 = 2 pixels. The second-order pass enhances clusters of such regions at scale λ2 = 4 pixels. Such clusters contribute significantly to the S2 coefficient (e.g., the top-right and center-left green frames in Fig. 4). Regions with sharp variations organization at larger scales (central green frame in Fig. 4) contribute to a lesser extent, while quasi-uniform regions are smoothed out and have only negligible contribution (bottom-left white frames in Fig. 4).

Fig. 4.
Fig. 4.

Transformation chain for scattering coefficients. (left) The base image represents wind speed for an AROME-EPS field. (center) A first-order, small-scale scattering map, and (right) a second-order, medium-scale scattering map. Regions of interest are framed: regions having an important S2 contribution are in green, and a region with a negligible S2 contribution is in white. Spatial subsampling is due to the wavelet decomposition algorithm (Andreux et al. 2018).

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

Scattering coefficients can be used to measure the sparsity of a signal. A sparse signal will concentrate small-scale variations on localized points and exhibit large-scale organization of these local variations. Cheng and Ménard (2021) introduced the following quantity as a sparsity estimator:
s21(λ1,λ2)=S2(λ1,λ2)S1(λ1)θ1,θ2.
A high value of s21 for a given scale pair λ1, λ2 indicates that a significant quantity of signal information lies in the second-order coefficient: the organization of the λ1 map is important to account for the image global structure, and the field is rather sparse. On the contrary, a field exhibiting a uniform λ1 map would have a low s21 for λ2 > λ1 (since only few regions would exhibit variations at scale λ2). Typically, Gaussian noise maps are not sparse: most of the information they contain can be extracted from their spectrum or almost equivalently from their S1 coefficients. In appendix B, we show that a Gaussian noise field with the exact spectrum of AROME-EPS samples exhibits lower s21 than do the AROME-EPS samples.
Averaging over orientation reduces the number of coefficients at the expense of orientation-related information. To obtain information about fields anisotropy, Cheng and Ménard (2021) proposed a “shape” estimator:
s22(λ1,λ2)=S2(λ1,λ2)θ1=θ2S2(λ1,λ2)θ1θ2θ1.
The s22 estimator helps determine in what directions multiscale interaction is most likely to happen, regardless of the initial orientation of patterns (represented by θ1). They interpret large s22 as a marker for filaments (with θ2 = θ1 directions producing higher S2 values), while lower s22 indicates the presence of more-round shapes (with θ1θ2 directions producing higher values). Again, this measure goes beyond radially averaged spectra, as a Gaussian field with the spectrum of AROME-EPS shows lower s22.

Both s21 and s22 estimators provide a set of coefficients (one for each λ1, λ2 couple with λ1 < λ2). One can estimate the discrepancy between the GAN and AROME-EPS data for both the s21 and s22 distributions as a measure of good reproduction of atmospheric structures. The sets of average s21 and s22 can be used to calculate RMSE distances between AROME-EPS and the GAN, yielding two new metrics per variable. These distances are evaluated during training with 16 384 batches and serve as quality estimators, similar to PSD errors.

e. Complementary metrics

Cross-variable correlations will be assessed using bidimensional histograms similar to those used by Gagne et al. (2020). For a given pair of variables and a given dataset, these histograms provide the empirical density function for the values taken by the pair of variables. Ideally, the densities outputted by the GAN should overlap the densities extracted from the AROME-EPS. This is a graphical illustration of the precision-recall metrics used, for example, in Kynkäänniemi et al. (2019). Among others, this enables the identification of systematic biases in the GAN-produced distribution. Finally, maps of 10th and 90th percentiles and interpercentile range will be examined in order to focus on the representation of distribution tails.

The whole set of metrics used is summarized in Table 2, detailing attributes for each. As explained above, the metrics are used to measure either diversity or quality. They make use of several types of information: pixelwise information does not take any spatial correlation into account, neighborhood information represents the aggregation of information over a limited range of pixels, and, finally, nonlocal information aggregates information over the full sample scale, either by Fourier transform, or random sampling of neighborhoods. Finally, metrics measure different features of the signal, notably scale-by-scale information, multiscale organization, positional information (when the metric is plotted on a map), or anisotropy.

Table 2.

Summary of the metrics used in the text. Checkmarks indicate which attributes a given metric possesses, crosses indicate the absence of such attribute. Correlation is abbreviated as corr., err. means error, and est. indicates estimate.

Table 2.

f. Evaluation strategy

We chose not to, a priori, separate the dataset between training, validation, and testing data. Though it is similar to the methodology of Besombes et al. (2021), this is arguably not a common practice in machine learning. Nevertheless, it makes sense in our setup, for the following reasons:

  • The generator is unconditional and never takes any inputs other than latent vectors. Its ability to generate good-quality, high-resolution samples and a correct distribution, once trained, only depends on the mapping it makes between the Gaussian latent distribution and the distribution of samples in the “physical space.”

  • The aim of this study is to assess what features of the data distribution the GAN is able to produce. Comparing the output distribution of the GAN to the training distribution thus avoids having to take into account the necessary distribution shift that occurs when splitting the dataset between train, test, and validation.

  • Detecting mode collapse or, more generally, the loss of diversity, can be done through the combined use of EMD distances, since these metrics provide large scores to distributions with too low spread. Cross-variable correlations can also help detect biased or nonoverlapping distributions.

  • Loss of quality can be examined through average PSD error, average correlation length scales estimation, and average scattering coefficient errors, each metric having its own scope.

To compare different hyperparameter sets, we focus exclusively on EMD distances and PSD error. Once a satisfactory set is selected, we examine the samples produced by the GAN with the other metrics.

To be consistent with the lower-bound estimation procedures detailed in section 3a, each metric is applied to 16 384 random samples from the AROME-EPS dataset and to the same number of random GAN outputs. Especially for SWDmulti distances, this rather large number of samples reduces the estimator’s variance, as in Odena et al. (2017) and Karras et al. (2018).

4. Results

a. Training stability and convergence

Even with widely used regularization strategies, training a GAN requires careful tuning of parameters to ensure local convergence.

Initializing with a learning rate lr0 = 4 × 10−3 and using exponential learning-rate decays (lr = lr0γt, with t being the number of epochs and γ = 0.9) avoids mode collapse and produces realistic-looking samples for all batch sizes except the largest (512). Setting different learning rates or different decay rates γ for D and G was detrimental to quality and convergence, even from a mere visual perspective. In agreement with Mescheder et al. (2018), removing learning-rate decay had terrible results on performance, leading to severe mode collapse and forcing each pixel to the global dataset average value (close to 0). Keeping γ = 0.9, we select the best-performing configuration among several batch sizes and learning rates according to estimates of W1,r/c, SWDmulti, and PSD errors, keeping the rest of metrics for posttraining evaluation. The configuration with a batch size (BS) of 32 and lr0 = 4 × 10−3 is selected as it globally has the lowest distributional distances and PSD errors simultaneously. Details of the hyperparameters selection are shown in appendix C.

Table 3 shows the scores obtained at the end of the run (i.e., when the loss curve plateaus). Figure 5 compares the average spectrum produced by the GAN to the AROME-EPS spectrum, showing good agreement and low error (below or around 1 dB for each variable), for all scales. Even the sharp variations of temperature spectra (mostly due to topographic variations) are correctly reconstructed despite slightly higher PSD errors. Notably, for the smallest scales, no significant drop of PSD can be observed, meaning that even large wavenumbers are given a correct amount of energy.

Table 3.

Scores obtained with the different metrics used by the best-performing configuration of hyperparameters: PSD errors, SWD estimates, and RMSE of scattering estimators with respect to AROME. Each configuration was run three times to account for training variability; the scores presented are the best obtained among these runs. Appendix B (especially Figs. B1 and B2) provides means to interpret the absolute values herein provided.

Table 3.
Fig. 5.
Fig. 5.

Typical GAN PSD spectrograms (red line) obtained for the best configuration (BS = 32; lr0 = 4 ×10−3), for each variable. Black dots indicate the average AROME-EPS spectrum.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

One noticeable feature of Table 3 is the large difference observed from one SWD component to another: they improve as the component’s scale decreases. Such behavior was observed with several hyperparameter configurations. Small-scale components such as SWD128 and SWD64 get the lowest (i.e., best) scores: small-scale distributions of local patterns are hence more correctly fitted, at least by the best configurations. On the other hand, even these configurations have sensibly higher SWD32 and SWD16 values. The floor values given in Table 1 are indeed larger for SWD16 than for other SWD components, indicating a larger intrinsic variability of the dataset for these scales. However, the gap between the scores of the GAN capabilities and the AROME lower bound is larger for SWD32 and SWD16: this observation might then indicate a poorer fit of large-scale pattern diversity. This result was observed for all hyperparameter configurations (cf. appendix C). It is then reasonable to think this is related to the network architecture or to the training setup.

In the remainder of this section, we provide an in-depth analysis of the GAN performances using the set of metrics previously introduced.

b. Validation of results

1) Visual examination

As a first step, the quality of the GAN generations can be assessed subjectively. Some samples are presented in Figs. 6 and 7 for visual comparison.

Fig. 6.
Fig. 6.

A random AROME-EPS sample is shown in the left column. The right columns show random samples from the GAN. The AROME-EPS samples are provided for visual comparison only as there is no one-to-one correspondence between the AROME-EPS samples and the GAN samples.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

Fig. 7.
Fig. 7.

Wind speed and superimposed wind direction. AROME-EPS and random samples from the GAN are shown from left to right. Arrows are regularly spaced, with length proportional to intensity. The GAN produces diverse and consistent situations for the wind but lacks the detailed structure of the AROME data, especially over the sea.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

First, the resolution of GAN-produced samples appears to be correct when compared to AROME-EPS outputs. As can be seen especially on temperature maps, fine-grained mountainous regions are correctly generated by the GAN, with a visible cooling with altitude. More detailed comparisons can be made with the help of the geographical features given in Fig. 1. Lengthy wind structures are preferentially created over sea, while they are more granular over land. Specific weather patterns are also present in the GAN’s samples, such as strong northerly wind running down the Rhône valley, an event locally known as mistral. Figure 7 also shows that the GAN is capable of producing consistent wind direction and speed at the highest detail level. The wind map especially confirms the ability of the GAN to generate not only events such as mistral but also a rather large diversity of wind patterns. Qualitatively speaking, our GAN is thus arguably devoid of mode collapse. What is more, appendix D shows that the GAN does not simply memorize exact samples from the dataset but, indeed, produces unseen, distinct data samples.

On the other hand, the GAN often produces blurry wind patterns over sea, while AROME-EPS samples are significantly more structured. The clear wind fronts present in the AROME-EPS dataset also appear in the GAN’s samples, but, except for the mistral case, they lack a long-range consistency and the filamentary aspect of AROME-EPS wind.

2) EMD maps

Pixelwise maps of EMD for chosen iterations indicate which regions provide the largest divergence. Figure 8 shows a comparison between the original dataset variance for each variable and two maps of pixelwise EMD at two different steps of training. Regions with highest variance are globally less easily learned by the GAN than their low-variance counterparts, especially at the beginning of training. The land–sea mask is well visible here, as can be expected from physical arguments. Indeed, wind variability over sea is higher: it depends more on the global weather situation (e.g., the presence of fronts) and is not forced by the topography. It coincides with long-range structures with a broad range of directions and intensities. This is not the case over land, where surface plays a major role in reducing the range of correlations. The northerly mistral path is clearly highlighted, as well as the easternmost and westernmost wind variability poles (roughly corresponding to tramontane wind episodes in the west). On the other hand, sea temperature is relatively stable because of the water thermal inertia, while the diurnal cycle is far more pronounced over land. Especially, temperatures at mountainous summits are difficult to reproduce because cold extremes of the distribution are susceptible to occur there.

Fig. 8.
Fig. 8.

(top) AROME-EPS per-variable variance maps (normalized between 0 and 1) and (middle),(bottom) the pixelwise EMD maps for two different training steps. The values of EMD (estimated with 16 384 samples) are shown on a common logarithmic scale to emphasize spatial variations and training progression. Regions of higher variance (circled in black in the top row) exhibit higher error than others at the beginning of the training (middle row); lower variance regions have lower error (green oval in top row); this difference tends to vanish near training end (bottom row).

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

This implies variance-related error is probably a strong learning signal for the GAN, especially at the beginning of the process. It is consistent with the observations made in the previous subsection about position-related distributions being the easiest to fit.

3) Correlation length scales

The average spectrum of our dataset is almost perfectly fit by our GAN, but it does not take much time for a human observer to distinguish between AROME-EPS and GAN samples. To further analyze the spatial structures of AROME and GAN fields, maps of correlation lengths Lcorr are shown on Fig. 9. These maps show a correct reconstruction of length scales on land, where the location of high and low correlation areas is accurate and the length scale magnitude order is right. On the contrary, length scales over sea are noisy and exhibit artifacts (checkerboard patterns and border effects), showing a clear gap of quality with respect to land. Note that these artifacts only appear while inspecting this specific metric; they are either difficult or impossible to spot on individual samples. As will be discussed in section 6, this is linked to positional information given by specific land patterns. As this information vanishes over sea, nonoptimized gradients may show up on this subdomain.

Fig. 9.
Fig. 9.

Correlation length maps for (top) AROME-EPS, (middle) the GAN, and (bottom) the AROME-EPS and GAN difference, for each variable separately. Color scales (km) are different for each variable but common between AROME-EPS and the GAN.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

4) Second-order scattering metrics

The s21 and s22 coefficients are plotted in Fig. 10. One can thus compare the distributions of these estimators for GAN samples with the ones of AROME-EPS. At least for small scales, the AROME-EPS dataset is significantly sparser than its GAN counterpart (higher s21). Moreover, AROME-EPS presents sensibly higher s22, indicating it contains more anisotropic, filamentary structures than the GAN samples. The s22 “shape” estimators are rather better fitted by the GAN than the sparsity s21 estimators. These observations are consistent with visual inspection of samples. While average spectrograms are nearly indistinguishable for this run, most s21 estimators differ significantly. Indeed, the average coefficients are at least one standard deviation away from one another for small λ1. This difference weakens with larger λ1, showing that large-scale organization is better recovered by the GAN.

Fig. 10.
Fig. 10.

Scattering (top) s21 and (bottom) s22 estimators for u, υ, and t2m. Scales considered go from λ1 = 2 grid points (5.2 km) to λ1 = 8 grid points (20.8 km), and λ2 goes up to 16 grid points (41.6 km). Dashed lines represent average quantities, shades represent ± standard deviations.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

Both s21 and s22 distances decrease with training, and this is consistent with the increasing quality of GAN outputs (not shown). However, the training dynamics is different from one variable to another. While the sparsity s21 distance for t2m is higher than for u and υ, the s22 distance is lower for t2m than for the wind variables. Globally, this indicates that both s21 and s22 are reasonable estimators to describe the GANs performance to reproduce the AROME-EPS field’s structures.

5) Bivariate histograms

Figure 11 shows bivariate histograms of AROME-EPS and GAN samples. A first observation is that the mean and variance of all variables are adequately captured by the GAN. The GAN also surprisingly extrapolates beyond AROME-EPS’s data, putting significant probability mass on regions closer to the dataset extremes. Meanwhile, it withdraws mass on other parts of the AROME-EPS distribution. Nevertheless, the logarithmic density scale of the histogram shows that the main modes of the distributions overlap, strengthening the assessment of a correct, global behavior.

Fig. 11.
Fig. 11.

Bivariate histograms representing the cross-variable correlations. The axes represent the values taken by each variable (on the common, normalized scale used for the training). Contours represent the density of data points for each value, on a logarithmic scale: full, colorized contours account for the AROME-EPS distribution; grayscale contour lines account for the GAN distribution with identical levels to AROME-EPS. Histograms are computed from 16 384 samples for each dataset, each pixel of which is one data point (2.7 × 105 data points altogether). Note the parts where the GAN extrapolates beyond AROME-EPS (red ovals) and the parts where it does not recover AROME-EPS (green ovals).

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

6) Percentiles and interpercentile range

To complete the overview of generation performance, a comparison of percentiles is performed over 66 048 samples (i.e., the exact size of the dataset to avoid sampling-related bias). The quantities considered are the 90th and 10th percentiles (Q90, Q10), as well as the 10–90 interpercentile range (ΔQ). Figure 12 compares the GAN and AROME-EPS statistics. The maximum percentile error of the GAN is limited but can go up to 3–4 m s−1 for wind data and 4 K for temperature.

Fig. 12.
Fig. 12.

Difference of (top), (middle) percentiles and (bottom) relative interpercentile range for each variable. Subscript A denotes AROME-EPS; subscript G denotes GAN. Red indicates positive bias of the GAN with respect to AROME-EPS; blue ones denote negative bias.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

Some regions also show a larger interpercentile range for the GAN, mostly over land for wind and over sea for t2m. Others show a narrower range, mostly the Rhône valley for u and the mountains for t2m. The interpercentile range of the GAN is closer to the one of AROME for temperature than for wind components, where it can be as large as 100%. This supports the fact that the GAN is probably influenced by the positional nature of temperature data.

The average bias of the GAN over all grid points is close to zero for all variables and statistics, except for Q10 on t2m. Localized, stronger biases exist, however, and they depend on the location as well as on the variable considered. This further supports that the GAN fits the distribution of values, including relatively extreme ones, in an unbiased manner but can locally exhibit strong deviations from the AROME-EPS distribution, as shown in cross-variable sections.

As a partial conclusion for this section, it has been shown that the GAN has achieved very good quality in terms of sample realism and diversity, power spectrum reproduction, and joint distribution recovery. Moreover, the GAN can generate thousands of samples in a time range of seconds (inference time is around 60 s for 16 384 samples), making the approach interesting in an ensemble generation framework. Finally, the set of metrics used has been shown to provide a detailed, complementary view of the GAN’s capabilities and weaknesses.

5. Multivariate configurations: A comparison

The impact of multivariate generation on training now has to be estimated in order to assess whether adding variables helps the GAN identify useful correlations or just makes the task more difficult. The experiment is conducted with four different configurations (summed up in Table 4):

  1. Baseline configuration using (u, υ, t2m) as generated fields.

  2. Configuration (Config.) 1: Removing the t2m field and keeping only the generation of the (u, υ) couple.

  3. Config. 2: Removing the (u, υ) couple and keeping the generation of t2m field.

  4. Config. 3: Adding the generation of orography to Config. 2. The constant field of orography is generated by the GAN and also taken into account by the discriminator. This is done to test whether explicitly adding information related to position adds value for the generation of temperature (which is more correlated to position than the wind).

Table 4.

Summary of the hyperparameters selected for the multivariate experiments.

Table 4.

For each configuration, the best hyperparameters are selected within the previously used parameter range for lr0 and BS. This assessment is made on averaging three runs’ scores with on-the-fly validation metrics (W1,r/c and SWDmulti) for 4096-sample batches, completed by visual inspection of samples. This allows for fast and reliable selection of hyperparameters, which are summed up in Table 4. Once these are selected, evaluation is performed on another set of three runs for each selected configuration. This final evaluation is performed with batches of 16 384 samples, and all the previously described metrics are used to yield the most extensive evaluation. The results are reported in Tables 5 and 6.

Table 5.

Global score card to compare multivariate experiments. Reported scores correspond to the average best score obtained after training saturation for the three runs. For all metrics considered, lower is better. Better scores with respect to the baseline are shown in bold; worse scores are in italic. A — is used when the metric is not applicable to the configuration.

Table 5.
Table 6.

Mean absolute error for correlation length maps. Reported scores correspond to the best average score obtained after training saturation for the three runs. For all metrics considered, lower is better. Better scores with respect to the baseline are shown in bold; worse scores are in italic. A — is used when the metric is not applicable to the configuration.

Table 6.

Tables 5 and 6 show that a general effect of reducing the number of generated variables is an increasing performance on most metrics related to spatial consistency (PSD, correlation lengths, scattering metrics). Another interesting pattern is that scores of Config. 3 (t2m and orography) are generally worse than those of Config. 2 (t2m only), and sometimes even than baseline. Adding orography information thus seems to have a mixed effect. On the one hand, it degrades the synthesis of temperature spatial structures, as emphasized by PSD error, correlation length error, and scattering metrics. On the other hand, W1,r/c scores, as well as the SWD16 scores, are sensibly improved when orography is added. Removing temperature and orography and keeping wind variables have an opposite effect. Indeed, Config. 1 shows improved spectral, scattering, and correlation lengths metrics, while W1,r/c scores slightly degrade and SWDmulti scores dramatically degrade. This shows a lack of ability of the GAN to capture the diversity of patterns in the dataset, while improving the quality of individual samples. The largest distributional discrepancy of Config. 1 even hints at some form of mode collapse.

Altogether, the quality of samples also improves when comparing Configs. 1 and 2 to baseline (Fig. 13). This is consistent with the majority of metrics involving spatial consistency. Especially, scattering metrics show a drastic improvement when reducing the number of variables. The GAN is, therefore, much more able to identify and generate multiscale organization in the samples, albeit at the expense of pattern diversity. This points to the GAN using cross-variable correlations to improve the diversity, rather than the quality, of samples.

Fig. 13.
Fig. 13.

Comparison of random samples from the GAN when shifting the training configuration. One random AROME-EPS sample is left for comparison. The most successful configurations for Config. 1 (wind variables only) and Config. 2 (temperature only) are shown. In both configurations, most quality-related metrics do increase. The organization of long-range structures is enhanced in both Config. 1 and Config. 2, with fronts more visible and showing better long-range structuring. The value scale here is voluntarily left free to enable visual-only comparison.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

6. Learning absolute gridpoint position: Analysis and consequences

The above experimental results give a set of observations that can be exploited to diagnose the strengths and flaws of the training design and their interaction with neural network architecture.

Let us summarize some of them:

  1. The error signal at the beginning of the training is strongly correlated to the pixelwise variance of the dataset (section 4).

  2. Large-scale EMDs are far worse than small-scale EMDs in all configurations (sections 4 and 5).

  3. Performance for correlation length scales is far better on land than over the sea (section 4).

Given that learning is performed on a fixed spatial domain, it is very likely that the main source of information for learning in our setup is the implicit encoding of absolute gridpoint position. This phenomenon is already acknowledged in literature for convolutional networks (Alsallakh et al. 2021; Zhang 2019) and has been extensively studied in the case of GANs by Xu et al. (2020). It is usually explained by the use of padding in convolutional layers: adding rows and columns of zeros in the intermediate layers allows the network to detect the boundaries of the feature maps and, thus, implicitly infer the position of each pixel.

The present setup goes a step further by adding variables that are more or less directly correlated to surface state and, thus, to absolute gridpoint position. Temperature variability is mostly position-related over land, as it is obviously the case for orography and, also, although moderately, for 10-m wind. Over sea, this position-related information fades out, while transient features, such as wind fronts, are prominent: the bias is weaker and the GAN struggles to generate correct correlation structures.

This positional bias probably plays a key role of dedicating most of the networks’ power to extract and fit position-related features. Hence, this analysis is a plausible explanation for another set of observations:

  1. Right from the baseline configuration (where no explicit position is given to the network), the reconstruction of temperature correlation with altitude is very accurate. In this case, learning position-related features with fine-grained spatial detail is largely helpful.

  2. Adding orography as a constant field to generate is largely detrimental to the sample quality, according to the scores used, but improves the largest scales of multiscale SWD (section 5). Positional information at large scales gives information about the overall structure of the field and is then most useful to generate the right distribution of patterns. Conversely, reinforcing this bias through orography accelerates position-based overfitting.

  3. The GAN is not able to use cross-variable correlations to improve individual sample quality but maintains more diversity (section 5): the added information of each variable in the baseline configuration is likely redundant if it is only position related. This prevents the GAN from improving, by using the less redundant, transient features which differentiate the three variables.

  4. The GAN trades off diversity for quality in the wind-only configuration (section 5). This configuration is the one where positional information has least weight. It is, thus, probable that reducing the positional bias makes the GAN focus on transient features’ quality that play a larger role in discrimination while relaxing the diversity constraint and narrowing the distribution of patterns. This is likely guided by large-scale pattern detection being harder without positional bias, as evoked in point 2.

These explanations imply that the setup is prone to overfitting and that, contrary to the common assumption, increasing batch size will degrade the performance of the GAN. We thus conduct a final analysis using different initial learning rates and batch sizes (BS ∈ {32, 64, 128, 256, 512}, lr0 ∈ {4 × 10−4, 2 × 10−3, 4 × 10−3}). Using the baseline configuration, we train the GAN from scratch for each pair of learning rate and batch size, up to loss saturation. We first observe that loss saturation occurs earlier with increasing batch size (cf. appendix C), underpinning the above hypothesis. Figure 14 shows the relative degradation or improvement of metrics with respect to the BS = 32 configuration for each learning rate. Once saturated, W1,c only slightly increases with batch size at all learning rates. On the other hand, quality metrics such as PSD and s21 and s22 errors drastically degrade when batch size increases (up to ×5 and ×10 degradation for PSD). SWDavg follows a path in between. The effect is less pronounced with diminishing learning rates, but it was observed that smaller learning rates degrade scores globally (cf. appendix C).

Fig. 14.
Fig. 14.

Relative evolution of different scores with respect to their value at the BS = 32 configuration for different, decreasing learning rates.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

Reproducing pixelwise, variable-wise distributions is thus rather an “easy” mode of convergence, achievable for most GAN configurations. Increasing batch size then likely strengthens the position-learning dynamics. The GAN rapidly memorizes position-related features and more or less forgets about the transient structures, which are smoothed out by large batches.

This set of explanations is consistent with the hypothesis of absolute gridpoint position learning. It also echoes classical quality-diversity trade-offs encountered in GANs (Radford et al. 2015; Zhang et al. 2019; Brock et al. 2018) and the general perception-distortion trade-off of generative models (Blau and Michaeli 2018); in our case, the trade-off is balanced by the variables fed to the GAN and the amount of positional information they contain. Learning on a fixed domain is a crucial component of the present setup, which explains both the good quality of individual samples and gridpoint distributions, as well as the scale-dependent diversity fit.

Whether such positional bias should be learned is an open question in general. As was shown by temperature correlation with altitude, this is very helpful in weather-related tasks, where many features depend on the absolute position. The way it is added in the training process, through architectural features and the data themselves, can thus be clearly framed and controlled with a priori heuristics. As an example, the positional bias could be strongly attenuated with variables that are much less dependent on absolute position (e.g., temperature at 850 hPa). This could also be the case if the networks were trained on random domain patches while conditioning both networks through orography. In this case, the task is strictly more difficult as the diversity of the dataset increases. We performed some preliminary experiments in this randomized setup: while it seems that one, indeed, gets rid of positional bias (notably, increasing batch size improves SWDmulti and PSD errors), this remains to be detailed and confirmed in future work.

It is also possible that more sophisticated architectures such as ProGAN (Karras et al. 2018) and StyleGAN (Karras et al. 2019, 2020), which explicitly handle scale-dependent pattern generation and disentangle features, perform notably better on the same setup.

7. Conclusions and perspectives

In this paper, meaningful metrics have been developed and applied to assess the ability of a GAN to emulate outputs of the kilometer-scale AROME-EPS weather forecasting system. From the above evaluation, the following conclusions can be drawn:

  1. Multiple metrics, borrowed either from weather science or computer vision, are necessary to diagnose the ability of a GAN to consistently generate weather states. Namely, going beyond a mere spectral analysis to describe the quality of generated samples proved useful. Multiscale SWD was successful in characterizing the diversity evolution with scale; scattering coefficients were used to assess the consistency of structures, while the local correlation length scales enabled a positionwise analysis of correlation reconstruction.

  2. A residual GAN architecture is plainly able to generate multivariate distributions of NWP models at kilometer scale. In particular, it can reproduce detailed textures as well as long-range events with a good diversity. State-of-the-art regularization techniques and networks are necessary for this task, and their training parameters must be carefully set to avoid divergence.

  3. A study of multivariate generation was performed to probe the effects of adding and removing variables to the training setup. An important phenomenon happening in our GAN was characterized: the positional bias, induced both by padding and surface variables such as 2-m temperature, is a prominent drive of the learning process. This is a double-edged sword: it enables fast convergence and emulation of crucial features such as temperature correlation with altitude while it degrades the ability of the GAN to generate high-quality transient structures and accelerates the occurrence of overfitting. This seems to be part of the wider quality-diversity trade-off, whose major component in the present case is given by the specific weather variables used.

Precipitation has not been addressed in this work, because of the extremely skewed nature of its distribution, specifically the overwhelming class of zero-precipitation days. Resampling techniques (Sha et al. 2020; Ravuri et al. 2021) would arguably tackle this point, but the relatively small size of the database discouraged us to go further in this direction on a first trial. This is a natural path for future work.

Another promising path is the generation of states based on the current weather situation. This framework has been used by many downscaling studies (Leinonen et al. 2021; Harris et al. 2022) that take low-resolution data as inputs to generate ensembles of high-resolution outputs. One could then ask whether a GAN framework could be used to increase the size of operational ensemble forecasts at a minimal computing cost. Open challenges would then be the precise way to condition the GAN with ensemble outputs at the same resolution, as well as the ability to control GAN-produced outliers. We believe the results shown in this study are encouraging enough to go further in this direction.

Finally, a largely unexplored path is the production of temporal sequences of forecasts at the lead times usually covered by the operational NWP models (up to 48–72 h). While it remains open whether the GAN framework is adapted to such a task, this would be a necessary step in order to use data-driven, high-resolution, real-time ensemble emulation.

Acknowledgments.

We deeply thank Léa Berthomier and Bruno Pradel for their thorough technical support with Météo-France’s computing infrastructure. We thank Camille Besombes, Olivier Pannekoucke, Ronan Fablet, Arnaud Mounier, and Thomas Rieutord for insightful discussions. This work was performed as part of the ANR Project 21-CE46-007 “Probabilistic Prediction of Extreme Weather Events Based on AI/Physics Synergy (POESY)” led by one of the authors (L.R.). One of the authors (C.B.) performed this work during his Ph.D. program, which was funded by the French Ministère de la Transition Ecologique as part of the Ph.D. program of the Ingénieurs des Ponts, Eaux et Forêts Civil Corps.

Data availability statement.

The code used to train networks and analyze data through all metrics can be found at https://github.com/flyIchtus/multivariate-GAN. The AROME-EPS dataset used in this study will be made available at the end of the Agence Nationale de la Recherche (ANR) project.

APPENDIX A

Implementation Details

The implementation of our GAN makes use of several techniques acknowledged to either facilitate convergence or accelerate computing. Here are reported the ones that were helpful. The code is written with PyTorch (Paszke et al. 2019), using the multi-GPU Horovod API (Sergeev and Balso 2018).

  1. The residual blocks we use follow usual guidelines of literature (Miyato et al. 2018; Besombes et al. 2021; Ravuri et al. 2021). The main block consists of two stacked 3 × 3 convolutions followed with LeakyReLU and BatchNorm, with a bilinear upscale and downscale layer. Either a 1 × 1 convolution or a direct sum is used as residual shortcut.

  2. We use automatic mixed precision (AMP), casting most operations to half precision. This leads to a dramatic acceleration of training and slashes memory consumption, keeping all runs shorter than 12 h and leaving space for later development of architectures. Unfortunately, this also comes with stability issues: some of the runs produced cases of not a number (NaN) at their very beginning, with specific hyperparameter configurations being completely hampered by AMP while running smoothly with simple precision.

  3. It was found that the discriminator’s gradients or the failed cases were violently oscillating at the beginning of training. We then introduced a small warm-up procedure whereby the discriminator was updated several times for one update of the generator. Choosing an update ratio of 5, as in Gulrajani et al. (2017), on the single first generator step considerably reduced the oscillations of gradients and made training stable for most of the 180 runs conducted for this study.

  4. Initialization of the neural networks weights has been shown to be an important factor for training convergence (Bengio and Glorot 2010). Here we keep the default random initialization for all linear layers while using orthogonal initialization for all convolutional layers. Besides generally having a beneficial impact on training (Saxe et al. 2014), this naturally helps the spectral normalization regularization we adopted by starting with already spectrally normalized (random) weights.

APPENDIX B

Detailed Description of Metrics

a. Pixelwise Earth-mover distance

For a distribution with only one degree of freedom, Wasserstein (Earth-mover) distance amounts to comparing cumulative distribution functions through the integral:
W1=|F(x)F(x)|dx.
Being given two series of N sorted samples Sp and Sq drawn from and , respectively, computing this integral comes down to averaging the difference of sorted values:
W11Ni|Sp,iSq,i|.
The computing complexity is essentially due to sort [O(NlogN)]. The absolute value of W1 depends on the unit of the data, hence, on the normalization process. To give a view of what absolute values mean in our case, Fig. B1 presents two pixels with different W1 distances and gives a correspondence to the shape of GAN and AROME-EPS distributions. Poor distributional fit (misplaced density, poor reconstruction of bimodal data) is characterized by a rather high W1 value, while a good fit (correct spread and tails, bimodality captured) shows lower W1. This can be compared to the averaged W1,r/c obtained by the best-performing configuration (around 12 × 10−3–13 × 10−3).
Fig. B1.
Fig. B1.

Correspondence between Wasserstein distance estimation and distributional fit for two grid points in the best-performing configuration. (left) Grid point with poor fit; (right) correctly fitted grid point. Each distribution is estimated from 16 384 samples.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

b. Sliced Wasserstein distance

Here, we draw extensively from the work of Rabin et al. (2011) and Karras et al. (2018). The Wasserstein distance is known to be an informative metric for distributions (Arjovsky et al. 2017) yet is computationally intractable and exhibits large variance in high-dimensional spaces (Ramdas et al. 2015). Monte Carlo approximation of such a distance is defined in Rabin et al. (2011) and is an unbiased and robust way to cope with the burden of W1 estimation in high-dimensional spaces (Kolouri et al. 2018).

For multilevel estimation, we first decompose the original field into a Laplacian pyramid, from finest to coarsest scales. The process then consists of selecting a batch of random neighborhoods of 7 × 7 pixels and normalizing each variable of the samples with respect to batch and spatial mean and standard deviation. The distributions of neighborhoods from the GAN and from AROME-EPS samples are finally compared with the help of SWD. Since neighborhoods include several pixels, they have several degrees of freedom; thus, SWD is a way to estimate the multidimensional EMD on these neighborhoods. We use the parameters of Karras et al. (2018) without modification, with 512 unit directions for SWD and 128 random neighborhoods for each level.

c. Scattering coefficients

Obtaining scattering coefficients consists of successive convolutions with wavelet filter banks {ψλ}λ{20,,2J}. The λ index corresponds to the discrete scale of the filter bank, that is, the number of pixels in the filter’s bandwidth. The largest scale probed, denoted by index J, typically corresponds to a half of the spatial extent of the field. To treat two-dimensional fields, angular dependency is added to the family of wavelets ψ, so λ indexes both scale and direction θ within the [0,π/8,,7π/8] discrete interval: λ = (2j, θ). We consider the common family of complex Morlet wavelets, satisfying stability and invertibility constraints (Mallat 2012), and we use the Python package Kymatio developed by Andreux et al. (2018).

The convolution of a field X with this wavelet family yields a set of first-order scattering maps M1(λ1):
M1(λ1)=|Xψλ1|.
The convolution on a given λ1 identifies structures of typical length scales λ1. The modulus then provides robustness of the transform to local deformations: slightly deformed patterns at the scale of λ1 produce close values of |Xψλ1|. Combining convolution and modulus yields first-order scattering images. These images themselves exhibit specific structures that vary on several (larger) scales: second-order maps can also be extracted at scale λ2 > λ1:
M2(λ1,λ2)=||Xψλ1|ψλ2|.
These second-order maps represent the organization of λ1 structures at the scale λ2. Maps can then be spatially averaged to produce global, translation invariant coefficients. Namely,
S1(λ1)=M1(λ1)spaceS2(λ1,λ2)=M2(λ1,λ2)space,
where 〈⋅〉space denotes spatial averaging. As emphasized by Cheng et al. (2020), S1 coefficients are similar to spectral density, as they decompose the signal scale by scale and then average over spatial dimension. Second-order scattering coefficients probe the organization of each scale.
As a reminder, the summary statistics we use are drawn from Cheng and Ménard (2021). They compare the amount of information stored in different scattering coefficients. Signal sparsity is probed through an orientation-averaged comparison between second- and first-order coefficients:
s21(λ1,λ2)=S2(λ1,λ2)S1(λ1)θ1,θ2,
while distinction between roundish and filamentary shapes (accounting for anisotropy) is better probed with a ratio of colinear versus orthogonal orientations for second-order coefficients:
s22(λ1,λ2)=S2(λ1,λ2)θ1=θ2S2(λ1,λ2)θ1θ2θ1.
To illustrate our claims, we take a subset of 256 AROME samples of wind speed, and we generate Gaussian noise fields from the exact same spectrum. The created Gaussian maps have little multiscale organization, and AROME samples are thus sparser than their Gaussian counterparts. Figure B2 shows that while spectra are well aligned, there is a significant difference in the field’s structure, as shown by the s21 coefficient discrepancy. AROME is also slightly less isotropic, as shown by its higher s22 for the smallest λ1. This correlates well with the visual examination of samples.
Fig. B2.
Fig. B2.

Comparison of Gaussian noise and AROME wind speed with respect to (top) scattering estimators s21 and s22 and (bottom) to power spectral density. Samples are shown to provide visual assessment. Dashed lines correspond to average quantities, while shaded areas correspond to ± standard deviation. The scales shown range from 2 to 16 grid points.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

To provide estimates of absolute values for RMSE on scattering coefficients and SWDmulti, Table B1 summarizes the scores obtained by the Gaussian field maps and the baseline configuration. Note that since the samples are normalized before applying the RMSE, all variables share a common scale of values. This table shows that the GAN performs sensibly better than a Gaussian field in any configuration and that a Gaussian field does not recover the correct distribution of patterns as measured by SWDmulti.

Table B1.

Comparison between scores against AROME-EPS dataset for Gaussian field and the baseline GAN configuration. Scattering estimators have been averaged over all three variables for baseline configuration.

Table B1.

APPENDIX C

Hyperparameter Search

Experiments were carried out on five different batch sizes (BS ∈ {32, 64, 128, 256, 512}) and three different initial learning rates (lr0 ∈ {4 × 10−4, 2 × 10−3, 4 × 10−3}). Each configuration was run three times to account for training variability.

For all configurations, the discriminator loss curves exhibit a deep trough followed by a slower ascent, up to a value below 2.0, after which the loss plateaus; meanwhile, the generator loss produces bumps after an abrupt decrease, before oscillating around 0.0 when the D loss reaches its maximum level. Figure C1 exposes some examples of this behavior. The converged regime corresponds to situations where D is unable to separate the AROME-EPS and the GAN samples: it is likely to indicate convergence of the algorithm on a local minimum. Using BS ≥ 256 provokes rapid saturation of the losses, indicating early stagnation of learning. Reducing the learning rate attenuates this effect and lengthens the ascent part. The BS = 32 experiment did not reach the plateau for any of the learning rates but the highest, indicating that batch size and learning rate both control the learning speed. Another direct effect of batch size increase is the reduction of loss oscillations.

Fig. C1.
Fig. C1.

Typical patterns observed in training. The network losses (blue: discriminator; red: generator) are represented along with the number of steps for different batch sizes and initial learning rates. Discriminator losses tend to converge near 2.0 (blue lines), while generator losses oscillate around 0.0 (red lines).

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

At the point where the D loss reaches saturation, training is completed for most runs, as our control metrics often saturate (not shown). In some cases, some components of SWDmulti tend to slightly increase after the plateau, indicating possible overfitting. A score card for all the tested hyperparameter configurations can then be drawn. For each configuration, the best (i.e., lowest) score obtained is selected, for all metrics, after D loss saturation when it happens or the best score altogether. Results are reported in Table C1.

Table C1.

Scores obtained by each configuration for our metrics panel. Reported scores correspond to the best score obtained after training saturation for the three runs and are given in the order average/best/worst. For all metrics considered, lower is better. For a given configuration and a given run, all “best scores after saturation” do not necessarily correspond to the same step for all metrics. Overall the best scores are in bold; the worst scores are in italic.

Table C1.

Trying to select the best-performing configuration from this table, one can rule out the BS = 512 configurations, mainly because of high PSD errors. This corresponds to a very low visual quality of samples (blurry images with checkerboard artifacts). Small batch sizes seem to perform better than others, especially with respect to EMD and SWD metrics, even if they do not always perform best on PSD errors. Moreover, the score spread for each configuration is rather wide and it is not rare that different configurations produce scores on overlapping ranges. Altogether, lr0 = 4 × 10−3 seems to perform better than any other, with the BS = 32 configuration scoring best on a wide range of metrics.

Only the most successful configurations of Table C1 are kept to get a view of their EMD-scattering scores and check their ranking. These observations are summed up in Table C2. The ranking slightly changes: while increasing batch size degrades the scores similarly to previous experiments, an intermediate learning rate of 2 × 10−3 produces the best scores obtained. All experiments that are not reported in Table C2 show worse scores than the ones shown.

Table C2.

Scattering RMSE estimators for each variable and averaged over the three runs. Best results are in bold. Shown are only the most successful configurations as results from Table C1. The score ranking previously obtained with PSD and SWD slightly changes with this metric.

Table C2.

APPENDIX D

Does the GAN Copy the Dataset?

An important consistency check consists of verifying that the GAN does not memorize the training samples and is able to generate sufficiently different samples. To examine this aspect, we use the mean square error (MSE) and look for the pair of GAN and AROME-EPS samples with the lowest global (i.e., including all variables and grid points) MSE distance across both GAN-generated and AROME-EPS datasets. Such samples are shown on Fig. D1 and exhibit a noticeable visual difference. The distribution of global MSE distance of this specific GAN sample with the whole AROME-EPS dataset is also plotted. As can be seen in Fig. D2, this distribution peaks at about 0.1, which is approximately the variance of the normalized AROME-EPS dataset. The MSE minimum is thus sensibly distinct from any AROME sample while being at a consistent average MSE distance from the whole dataset, further confirming the absence of mode collapse.

Fig. D1.
Fig. D1.

(top) Comparing the MSE-nearest samples in the GAN and AROME-EPS datasets. These samples clearly differ from each other. (bottom) The pixelwise absolute distance is compared to the pixelwise standard deviation of the AROME-EPS sample. The global MSE related to this sample is reported at the bottom.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

Fig. D2.
Fig. D2.

Visualizing the distribution of MSE distance from AROME-EPS to the GAN sample of Fig. D1. The red dot denotes the distance to the nearest AROME-EPS sample. The distribution is broad, and its peak is of the order of magnitude of the normalized dataset variance.

Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-23-0006.1

REFERENCES

  • Alsallakh, B., N. Kokhlikyan, V. Miglani, J. Yuan, and O. Reblitz-Richardson, 2021: Mind the pad—CNNs can develop blind spots. ICLR 2021: Int. Conf. on Learning Representations, Vienna, Austria, ICLR, https://openreview.net/pdf?id=m1CD7tPubNy.

  • Andén, J., and S. Mallat, 2014: Deep scattering spectrum. IEEE Trans. Signal Process., 62, 41144128, https://doi.org/10.1109/TSP.2014. 2326991.

    • Search Google Scholar
    • Export Citation
  • Andreux, M., and Coauthors, 2018: Kymatio: Scattering transforms in Python. arXiv, 1812.11214v3, https://doi.org/10.48550/ARXIV.1812.11214.

  • Arjovsky, M., S. Chintala, and L. Bottou, 2017: Wasserstein generative adversarial networks. Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, PMLR, 214–223, https://proceedings.mlr.press/v70/arjovsky17a.html.

  • Bengio, Y., and X. Glorot, 2010: Understanding the difficulty of training deep feed forward neural networks. Proc. 13th Int. Conf. on Artificial Intelligence and Statistics, Sardinia, Italy, PMLR, 249–256, https://proceedings.mlr.press/v9/glorot10a.html.

  • Besombes, C., O. Pannekoucke, C. Lapeyre, B. Sanderson, and O. Thual, 2021: Producing realistic climate data with generative adversarial networks. Nonlinear Processes Geophys., 28, 347370, https://doi.org/10.5194/npg-28-347-2021.

    • Search Google Scholar
    • Export Citation
  • Bhatia, S., A. Jain, and B. Hooi, 2021: ExGAN: Adversarial generation of extreme samples. arXiv, 2009.08454v3, https://doi.org/10.48550/arXiv.2009.08454.

  • Bihlo, A., 2020: A generative adversarial network approach to (ensemble) weather prediction. arXiv, 2006.07718v1, https://doi.org/10.48550/arXiv.2006.07718.

  • Blau, Y., and T. Michaeli, 2018: The perception-distortion tradeoff. 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, IEEE, 6228–6237, https://doi.org/10.1109/CVPR.2018.00652.

  • Bouttier, F., B. Vié, O. Nuissier, and L. Raynaud, 2012: Impact of stochastic physics in a convection-permitting ensemble. Mon. Wea. Rev., 140, 37063721, https://doi.org/10.1175/MWR-D-12-00031.1.

    • Search Google Scholar
    • Export Citation
  • Bouttier, F., L. Raynaud, O. Nuissier, and B. Ménétrier, 2016: Sensitivity of the AROME ensemble to initial and surface perturbations during HyMeX. Quart. J. Roy. Meteor. Soc., 142, 390403, https://doi.org/10.1002/qj.2622.

    • Search Google Scholar
    • Export Citation
  • Brock, A., J. Donahue, and K. Simonyan, 2018: Large scale GAN training for high fidelity natural image synthesis. arXiv, 1809.11096v2, https://doi.org/10.48550/arXiv.1809.11096.

  • Brousseau, P., L. Berre, F. Bouttier, and G. Desroziers, 2011: Background-error covariances for a convective-scale data-assimilation system: Arome–France 3D-Var. Quart. J. Roy. Meteor. Soc., 137, 409422, https://doi.org/10.1002/qj.750.

    • Search Google Scholar
    • Export Citation
  • Bruna, J., and S. Mallat, 2013: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35, 18721886, https://doi.org/10.1109/TPAMI.2012.230.

    • Search Google Scholar
    • Export Citation
  • Cheng, S., and B. Ménard, 2021: How to quantify fields or textures? A guide to the scattering transform. arXiv, 2112.01288v1, https://doi.org/10.48550/ARXIV.2112.01288.

  • Cheng, S., Y.-S. Ting, B. Ménard, and J. Bruna, 2020: A new approach to observational cosmology using the scattering transform. Mon. Not. Roy. Astron. Soc., 499, 59025914, https://doi.org/10.1093/mnras/staa3165.

    • Search Google Scholar
    • Export Citation
  • Denis, B., J. Côté, and R. Laprise, 2002: Spectral decomposition of two-dimensional atmospheric fields on limited-area domains using the discrete cosine transform (DCT). Mon. Wea. Rev., 130, 18121829, https://doi.org/10.1175/1520-0493(2002)130<1812:SDOTDA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Descamps, L., C. Labadie, A. Joly, E. Bazile, P. Arbogast, and P. Cébron, 2015: PEARP, the Météo-France short-range ensemble prediction system. Quart. J. Roy. Meteor. Soc., 141, 16711685, https://doi.org/10.1002/qj.2469.

    • Search Google Scholar
    • Export Citation
  • Dumoulin, V., I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, 2016: Adversarially learned inference. arXiv, 1606.00704v3, https://doi.org/10.48550/ARXIV.1606.00704.

  • Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 5164, https://doi.org/10.1002/met.25.

    • Search Google Scholar
    • Export Citation
  • Gagne, D. J., II, H. M. Christensen, A. C. Subramanian, and A. H. Monahan, 2020: Machine learning for stochastic parameterization: Generative adversarial networks in the Lorenz ’96 model. J. Adv. Model. Earth Syst., 12, e2019MS001896, https://doi.org/10.1029/2019MS001896.

    • Search Google Scholar
    • Export Citation
  • Garcia, G. B., M. Lagrange, I. Emmanuel, and H. Andrieu, 2015: Classification of rainfall radar images using the scattering transform. 2015 23rd European Signal Processing Conf. (EUSIPCO), Nice, France, IEEE, 1940–1944, https://doi.org/10.1109/EUSIPCO.2015.7362722.

  • Goodfellow, I. J., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, 2014: Generative adversarial networks. arXiv, 1406.2661v1, https://doi.org/10.48550/ARXIV.1406.2661.

  • Gulrajani, I., F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, 2017: Improved training of Wasserstein GANs. arXiv, 1704.00028v3, https://doi.org/10.48550/arXiv.1704.00028.

  • Harris, L., A. T. T. McRae, M. Chantry, P. D. Dueben, and T. N. Palmer, 2022: A generative deep learning approach to stochastic downscaling of precipitation forecasts. arXiv, 2204.02028v2, https://doi.org/10.48550/ARXIV.2204.02028.

  • Karras, T., T. Aila, S. Laine, and J. Lehtinen, 2018: Progressive growing of GANs for improved quality, stability, and variation. arXiv, 1710.10196v3, https://doi.org/10.48550/arXiv.1710.10196.

  • Karras, T., S. Laine, and T. Aila, 2019: A style-based generator architecture for generative adversarial networks. arXiv, 1812.04948v3, https://doi.org/10.48550/arXiv.1812.04948.

  • Karras, T., S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, 2020: Analyzing and improving the image quality of StyleGAN. arXiv, 1912.04958v2, https://doi.org/10.48550/arXiv.1912.04958.

  • Kingma, D. P., and M. Welling, 2014: Auto-encoding variational Bayes. arXiv, 1312.6114v11, https://doi.org/10.48550/arXiv.1312.6114.

  • Kingma, D. P., and J. Ba, 2015: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.

  • Kingma, D. P., and M. Welling, 2019: An introduction to variational autoencoders. Found. Trends Mach. Learn., 12, 307392, .

  • Kolouri, S., P. E. Pope, C. E. Martin, and G. K. Rohde, 2018: Sliced-Wasserstein autoencoder: An embarrassingly simple generative model. arXiv, 1804.01947v3, https://doi.org/10.48550/arXiv.1804.01947.

  • Kynkäänniemi, T., T. Karras, S. Laine, J. Lehtinen, and T. Aila, 2019: Improved precision and recall metric for assessing generative models. arXiv, 1904.06991v3, https://doi.org/10.48550/arXiv.1904.06991.

  • Leinonen, J., D. Nerini, and A. Berne, 2021: Stochastic super-resolution for downscaling time-evolving atmospheric fields with a generative adversarial network. IEEE Trans. Geosci. Remote Sens., 59, 72117223, https://doi.org/10.1109/TGRS.2020.3032790.

    • Search Google Scholar
    • Export Citation
  • Lim, J. H., and J. C. Ye, 2017: Geometric GAN. arXiv, 1705.02894v2, https://doi.org/10.48550/arXiv.1705.02894.

  • Mallat, S., 2012: Group invariant scattering. Commun. Pure Appl. Math., 65, 13311398, https://doi.org/10.1002/cpa.21413.

  • Marin, I., S. Gotovac, M. Russo, and D. Božić-Štulić, 2021: The effect of latent space dimension on the quality of synthesized human face images. J. Commun. Software Syst., 17, 124133, https://doi.org/10.24138/jcomss-2021-0035.

    • Search Google Scholar
    • Export Citation
  • Mescheder, L., A. Geiger, and S. Nowozin, 2018: Which training methods for GANs do actually converge? arXiv, 1801.04406v4, https://doi.org/10.48550/ARXIV.1801.04406.

  • Miyato, T., T. Kataoka, M. Koyama, and Y. Yoshida, 2018: Spectral normalization for generative adversarial networks. arXiv, 1802.05957v1, https://doi.org/10.48550/arXiv.1802.05957.

  • Montmerle, T., Y. Michel, E. Arbogast, B. Ménétrier, and P. Brousseau, 2018: A 3D ensemble variational data assimilation scheme for the limited-area AROME model: Formulation and preliminary results. Quart. J. Roy. Meteor. Soc., 144, 21962215, https://doi.org/10.1002/qj.3334.

    • Search Google Scholar
    • Export Citation
  • Mustafa, M., D. Bard, W. Bhimji, Z. Lukić, R. Al-Rfou, and J. M. Kratochvil, 2019: CosmoGAN: Creating high-fidelity weak lensing convergence maps using generative adversarial networks. Comput. Astrophys. Cosmol., 6, 1, https://doi.org/10.1186/s40668-019-0029-9.

    • Search Google Scholar
    • Export Citation
  • Odena, A., C. Olah, and J. Shlens, 2017: Conditional image synthesis with auxiliary classifier GANs. arXiv, 1610.09585v4, https://doi.org/10.48550/arXiv.1610.09585.

  • Olea, R. A., 1994: Fundamentals of semivariogram estimation, modeling, and usage. Stochastic Modeling and Geostatistics: Principles, Methods, and Case Studies, J. M. Yarus and R. L. Chambers, Eds., American Association of Petroleum Geologists, 27–36, https://doi.org/10.1306/CA3590.

  • Pannekoucke, O., L. Berre, and G. Desroziers, 2008: Background-error correlation length-scale estimates and their sampling statistics. Quart. J. Roy. Meteor. Soc., 134, 497508, https://doi.org/10.1002/qj.212.

    • Search Google Scholar
    • Export Citation
  • Pantillon, F., P. Knippertz, and U. Corsmeier, 2017: Revisiting the synoptic-scale predictability of severe European winter storms using ECMWF ensemble reforecasts. Nat. Hazards Earth Syst. Sci., 17, 17951810, https://doi.org/10.5194/nhess-17-1795-2017.

    • Search Google Scholar
    • Export Citation
  • Paszke, A., and Coauthors, 2019: PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, Vol. 32, H. Wallach et al., Eds., Curran Associates Inc., 8024–8035.

  • Ponzano, M., B. Joly, L. Descamps, and P. Arbogast, 2020: Systematic error analysis of heavy-precipitation-event prediction using a 30-year hindcast dataset. Nat. Hazards Earth Syst. Sci., 20, 13691389, https://doi.org/10.5194/nhess-20-1369-2020.

    • Search Google Scholar
    • Export Citation
  • Rabin, J., G. Peyré, J. Delon, and B. Marc, 2011: Wasserstein barycenter and its application to texture mixing. Scale Space and Variational Methods in Computer Vision (SSVM’11), Springer, 435–446.

  • Radford, A., L. Metz, and S. Chintala, 2015: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv, 1511.06434v2, https://doi.org/10.48550/arXiv.1511.06434.

  • Ramdas, A., N. Garcia, and M. Cuturi, 2015: On Wasserstein two sample testing and related families of nonparametric tests. arXiv, 1509.02237v2, https://doi.org/10.48550/ARXIV.1509.02237.

  • Ravuri, S., and Coauthors, 2021: Skilful precipitation nowcasting using deep generative models of radar. Nature, 597, 672 677, https://doi.org/10.1038/s41586-021-03854-z.

    • Search Google Scholar
    • Export Citation
  • Raynaud, L., and O. Pannekoucke, 2013: Sampling properties and spatial filtering of ensemble background-error length-scales. Quart. J. Roy. Meteor. Soc., 139, 784794, https://doi.org/10.1002/qj.1999.

    • Search Google Scholar
    • Export Citation
  • Raynaud, L., and F. Bouttier, 2016: Comparison of initial perturbation methods for ensemble prediction at convective scale. Quart. J. Roy. Meteor. Soc., 142, 854866, https://doi.org/10.1002/qj.2686.

    • Search Google Scholar
    • Export Citation
  • Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 7897, https://doi.org/10.1175/2007MWR2123.1.

    • Search Google Scholar
    • Export Citation
  • Rubner, Y., C. Tomasi, and L. J. Guibas, 2004: The Earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis., 40, 99121, https://doi.org/10.1023/A:1026543900054.

    • Search Google Scholar
    • Export Citation
  • Saxe, A. M., J. L. McClelland, and S. Ganguli, 2014: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv, 1312.6120v3, https://doi.org/10.48550/arXiv.1312.6120.

  • Seity, Y., P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson, 2011: The AROME-France convective-scale operational model. Mon. Wea. Rev., 139, 976991, https://doi.org/10.1175/2010MWR3425.1.

    • Search Google Scholar
    • Export Citation
  • Sergeev, A., and M. D. Balso, 2018: Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv, 1802.05799v3, https://doi.org/10.48550/arXiv.1802.05799.

  • Sha, Y., D. J. Gagne II, G. West, and R. Stull, 2020: Deep-learning-based gridded downscaling of surface meteorological variables in complex terrain. Part II: Daily precipitation. J. Appl. Meteor. Climatol., 59, 20752092, https://doi.org/10.1175/JAMC-D-20-0058.1.

    • Search Google Scholar
    • Export Citation
  • Vincendon, B., V. Ducrocq, O. Nuissier, and B. Vié, 2011: Perturbation of convection-permitting NWP forecasts for flash-flood ensemble forecasting. Nat. Hazards Earth Syst. Sci., 11, 15291544, https://doi.org/10.5194/nhess-11-1529-2011.

    • Search Google Scholar
    • Export Citation
  • Weaver, A. T., and I. Mirouze, 2013: On the diffusion equation and its application to isotropic and anisotropic correlation modelling in variational assimilation. Quart. J. Roy. Meteor. Soc., 139, 242260, https://doi.org/10.1002/qj.1955.

    • Search Google Scholar
    • Export Citation
  • Xu, R., X. Wang, K. Chen, B. Zhou, and C. C. Loy, 2020: Positional encoding as spatial inductive bias in GANs. arXiv, 2012.05217v1, https://doi.org/10.48550/ARXIV.2012.05217.

  • Zhang, H., I. Goodfellow, D. Metaxas, and A. Odena, 2019: Self-attention generative adversarial networks. arXiv, 1805.08318v2, https://doi.org/10.48550/arXiv.1805.08318.

  • Zhang, R., 2019: Making convolutional networks shift-invariant again. Proc. 36th Int. Conf. on Machine Learning, Long Beach, CA, PMLR, 7324–7334, http://proceedings.mlr.press/v97/zhang19a.html.

Save
  • Alsallakh, B., N. Kokhlikyan, V. Miglani, J. Yuan, and O. Reblitz-Richardson, 2021: Mind the pad—CNNs can develop blind spots. ICLR 2021: Int. Conf. on Learning Representations, Vienna, Austria, ICLR, https://openreview.net/pdf?id=m1CD7tPubNy.

  • Andén, J., and S. Mallat, 2014: Deep scattering spectrum. IEEE Trans. Signal Process., 62, 41144128, https://doi.org/10.1109/TSP.2014. 2326991.

    • Search Google Scholar
    • Export Citation
  • Andreux, M., and Coauthors, 2018: Kymatio: Scattering transforms in Python. arXiv, 1812.11214v3, https://doi.org/10.48550/ARXIV.1812.11214.

  • Arjovsky, M., S. Chintala, and L. Bottou, 2017: Wasserstein generative adversarial networks. Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, PMLR, 214–223, https://proceedings.mlr.press/v70/arjovsky17a.html.

  • Bengio, Y., and X. Glorot, 2010: Understanding the difficulty of training deep feed forward neural networks. Proc. 13th Int. Conf. on Artificial Intelligence and Statistics, Sardinia, Italy, PMLR, 249–256, https://proceedings.mlr.press/v9/glorot10a.html.

  • Besombes, C., O. Pannekoucke, C. Lapeyre, B. Sanderson, and O. Thual, 2021: Producing realistic climate data with generative adversarial networks. Nonlinear Processes Geophys., 28, 347370, https://doi.org/10.5194/npg-28-347-2021.

    • Search Google Scholar
    • Export Citation
  • Bhatia, S., A. Jain, and B. Hooi, 2021: ExGAN: Adversarial generation of extreme samples. arXiv, 2009.08454v3, https://doi.org/10.48550/arXiv.2009.08454.

  • Bihlo, A., 2020: A generative adversarial network approach to (ensemble) weather prediction. arXiv, 2006.07718v1, https://doi.org/10.48550/arXiv.2006.07718.

  • Blau, Y., and T. Michaeli, 2018: The perception-distortion tradeoff. 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, IEEE, 6228–6237, https://doi.org/10.1109/CVPR.2018.00652.

  • Bouttier, F., B. Vié, O. Nuissier, and L. Raynaud, 2012: Impact of stochastic physics in a convection-permitting ensemble. Mon. Wea. Rev., 140, 37063721, https://doi.org/10.1175/MWR-D-12-00031.1.

    • Search Google Scholar
    • Export Citation
  • Bouttier, F., L. Raynaud, O. Nuissier, and B. Ménétrier, 2016: Sensitivity of the AROME ensemble to initial and surface perturbations during HyMeX. Quart. J. Roy. Meteor. Soc., 142, 390403, https://doi.org/10.1002/qj.2622.

    • Search Google Scholar
    • Export Citation
  • Brock, A., J. Donahue, and K. Simonyan, 2018: Large scale GAN training for high fidelity natural image synthesis. arXiv, 1809.11096v2, https://doi.org/10.48550/arXiv.1809.11096.

  • Brousseau, P., L. Berre, F. Bouttier, and G. Desroziers, 2011: Background-error covariances for a convective-scale data-assimilation system: Arome–France 3D-Var. Quart. J. Roy. Meteor. Soc., 137, 409422, https://doi.org/10.1002/qj.750.

    • Search Google Scholar
    • Export Citation
  • Bruna, J., and S. Mallat, 2013: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35, 18721886, https://doi.org/10.1109/TPAMI.2012.230.

    • Search Google Scholar
    • Export Citation
  • Cheng, S., and B. Ménard, 2021: How to quantify fields or textures? A guide to the scattering transform. arXiv, 2112.01288v1, https://doi.org/10.48550/ARXIV.2112.01288.

  • Cheng, S., Y.-S. Ting, B. Ménard, and J. Bruna, 2020: A new approach to observational cosmology using the scattering transform. Mon. Not. Roy. Astron. Soc., 499, 59025914, https://doi.org/10.1093/mnras/staa3165.

    • Search Google Scholar
    • Export Citation
  • Denis, B., J. Côté, and R. Laprise, 2002: Spectral decomposition of two-dimensional atmospheric fields on limited-area domains using the discrete cosine transform (DCT). Mon. Wea. Rev., 130, 18121829, https://doi.org/10.1175/1520-0493(2002)130<1812:SDOTDA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Descamps, L., C. Labadie, A. Joly, E. Bazile, P. Arbogast, and P. Cébron, 2015: PEARP, the Météo-France short-range ensemble prediction system. Quart. J. Roy. Meteor. Soc., 141, 16711685, https://doi.org/10.1002/qj.2469.

    • Search Google Scholar
    • Export Citation
  • Dumoulin, V., I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, 2016: Adversarially learned inference. arXiv, 1606.00704v3, https://doi.org/10.48550/ARXIV.1606.00704.

  • Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 5164, https://doi.org/10.1002/met.25.

    • Search Google Scholar
    • Export Citation
  • Gagne, D. J., II, H. M. Christensen, A. C. Subramanian, and A. H. Monahan, 2020: Machine learning for stochastic parameterization: Generative adversarial networks in the Lorenz ’96 model. J. Adv. Model. Earth Syst., 12, e2019MS001896, https://doi.org/10.1029/2019MS001896.

    • Search Google Scholar
    • Export Citation
  • Garcia, G. B., M. Lagrange, I. Emmanuel, and H. Andrieu, 2015: Classification of rainfall radar images using the scattering transform. 2015 23rd European Signal Processing Conf. (EUSIPCO), Nice, France, IEEE, 1940–1944, https://doi.org/10.1109/EUSIPCO.2015.7362722.

  • Goodfellow, I. J., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, 2014: Generative adversarial networks. arXiv, 1406.2661v1, https://doi.org/10.48550/ARXIV.1406.2661.

  • Gulrajani, I., F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, 2017: Improved training of Wasserstein GANs. arXiv, 1704.00028v3, https://doi.org/10.48550/arXiv.1704.00028.

  • Harris, L., A. T. T. McRae, M. Chantry, P. D. Dueben, and T. N. Palmer, 2022: A generative deep learning approach to stochastic downscaling of precipitation forecasts. arXiv, 2204.02028v2, https://doi.org/10.48550/ARXIV.2204.02028.

  • Karras, T., T. Aila, S. Laine, and J. Lehtinen, 2018: Progressive growing of GANs for improved quality, stability, and variation. arXiv, 1710.10196v3, https://doi.org/10.48550/arXiv.1710.10196.

  • Karras, T., S. Laine, and T. Aila, 2019: A style-based generator architecture for generative adversarial networks. arXiv, 1812.04948v3, https://doi.org/10.48550/arXiv.1812.04948.

  • Karras, T., S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, 2020: Analyzing and improving the image quality of StyleGAN. arXiv, 1912.04958v2, https://doi.org/10.48550/arXiv.1912.04958.

  • Kingma, D. P., and M. Welling, 2014: Auto-encoding variational Bayes. arXiv, 1312.6114v11, https://doi.org/10.48550/arXiv.1312.6114.

  • Kingma, D. P., and J. Ba, 2015: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.

  • Kingma, D. P., and M. Welling, 2019: An introduction to variational autoencoders. Found. Trends Mach. Learn., 12, 307392, .

  • Kolouri, S., P. E. Pope, C. E. Martin, and G. K. Rohde, 2018: Sliced-Wasserstein autoencoder: An embarrassingly simple generative model. arXiv, 1804.01947v3, https://doi.org/10.48550/arXiv.1804.01947.

  • Kynkäänniemi, T., T. Karras, S. Laine, J. Lehtinen, and T. Aila, 2019: Improved precision and recall metric for assessing generative models. arXiv, 1904.06991v3, https://doi.org/10.48550/arXiv.1904.06991.

  • Leinonen, J., D. Nerini, and A. Berne, 2021: Stochastic super-resolution for downscaling time-evolving atmospheric fields with a generative adversarial network. IEEE Trans. Geosci. Remote Sens., 59, 72117223, https://doi.org/10.1109/TGRS.2020.3032790.

    • Search Google Scholar
    • Export Citation
  • Lim, J. H., and J. C. Ye, 2017: Geometric GAN. arXiv, 1705.02894v2, https://doi.org/10.48550/arXiv.1705.02894.

  • Mallat, S., 2012: Group invariant scattering. Commun. Pure Appl. Math., 65, 13311398, https://doi.org/10.1002/cpa.21413.

  • Marin, I., S. Gotovac, M. Russo, and D. Božić-Štulić, 2021: The effect of latent space dimension on the quality of synthesized human face images. J. Commun. Software Syst., 17, 124133, https://doi.org/10.24138/jcomss-2021-0035.

    • Search Google Scholar
    • Export Citation
  • Mescheder, L., A. Geiger, and S. Nowozin, 2018: Which training methods for GANs do actually converge? arXiv, 1801.04406v4, https://doi.org/10.48550/ARXIV.1801.04406.

  • Miyato, T., T. Kataoka, M. Koyama, and Y. Yoshida, 2018: Spectral normalization for generative adversarial networks. arXiv, 1802.05957v1, https://doi.org/10.48550/arXiv.1802.05957.

  • Montmerle, T., Y. Michel, E. Arbogast, B. Ménétrier, and P. Brousseau, 2018: A 3D ensemble variational data assimilation scheme for the limited-area AROME model: Formulation and preliminary results. Quart. J. Roy. Meteor. Soc., 144, 21962215, https://doi.org/10.1002/qj.3334.

    • Search Google Scholar
    • Export Citation
  • Mustafa, M., D. Bard, W. Bhimji, Z. Lukić, R. Al-Rfou, and J. M. Kratochvil, 2019: CosmoGAN: Creating high-fidelity weak lensing convergence maps using generative adversarial networks. Comput. Astrophys. Cosmol., 6, 1, https://doi.org/10.1186/s40668-019-0029-9.

    • Search Google Scholar
    • Export Citation
  • Odena, A., C. Olah, and J. Shlens, 2017: Conditional image synthesis with auxiliary classifier GANs. arXiv, 1610.09585v4, https://doi.org/10.48550/arXiv.1610.09585.

  • Olea, R. A., 1994: Fundamentals of semivariogram estimation, modeling, and usage. Stochastic Modeling and Geostatistics: Principles, Methods, and Case Studies, J. M. Yarus and R. L. Chambers, Eds., American Association of Petroleum Geologists, 27–36, https://doi.org/10.1306/CA3590.

  • Pannekoucke, O., L. Berre, and G. Desroziers, 2008: Background-error correlation length-scale estimates and their sampling statistics. Quart. J. Roy. Meteor. Soc., 134, 497508, https://doi.org/10.1002/qj.212.

    • Search Google Scholar
    • Export Citation
  • Pantillon, F., P. Knippertz, and U. Corsmeier, 2017: Revisiting the synoptic-scale predictability of severe European winter storms using ECMWF ensemble reforecasts. Nat. Hazards Earth Syst. Sci., 17, 17951810, https://doi.org/10.5194/nhess-17-1795-2017.

    • Search Google Scholar
    • Export Citation
  • Paszke, A., and Coauthors, 2019: PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, Vol. 32, H. Wallach et al., Eds., Curran Associates Inc., 8024–8035.

  • Ponzano, M., B. Joly, L. Descamps, and P. Arbogast, 2020: Systematic error analysis of heavy-precipitation-event prediction using a 30-year hindcast dataset. Nat. Hazards Earth Syst. Sci., 20, 13691389, https://doi.org/10.5194/nhess-20-1369-2020.

    • Search Google Scholar
    • Export Citation
  • Rabin, J., G. Peyré, J. Delon, and B. Marc, 2011: Wasserstein barycenter and its application to texture mixing. Scale Space and Variational Methods in Computer Vision (SSVM’11), Springer, 435–446.

  • Radford, A., L. Metz, and S. Chintala, 2015: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv, 1511.06434v2, https://doi.org/10.48550/arXiv.1511.06434.

  • Ramdas, A., N. Garcia, and M. Cuturi, 2015: On Wasserstein two sample testing and related families of nonparametric tests. arXiv, 1509.02237v2, https://doi.org/10.48550/ARXIV.1509.02237.

  • Ravuri, S., and Coauthors, 2021: Skilful precipitation nowcasting using deep generative models of radar. Nature, 597, 672 677, https://doi.org/10.1038/s41586-021-03854-z.

    • Search Google Scholar
    • Export Citation
  • Raynaud, L., and O. Pannekoucke, 2013: Sampling properties and spatial filtering of ensemble background-error length-scales. Quart. J. Roy. Meteor. Soc., 139, 784794, https://doi.org/10.1002/qj.1999.

    • Search Google Scholar
    • Export Citation
  • Raynaud, L., and F. Bouttier, 2016: Comparison of initial perturbation methods for ensemble prediction at convective scale. Quart. J. Roy. Meteor. Soc., 142, 854866, https://doi.org/10.1002/qj.2686.

    • Search Google Scholar
    • Export Citation
  • Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 7897, https://doi.org/10.1175/2007MWR2123.1.

    • Search Google Scholar
    • Export Citation
  • Rubner, Y., C. Tomasi, and L. J. Guibas, 2004: The Earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis., 40, 99121, https://doi.org/10.1023/A:1026543900054.

    • Search Google Scholar
    • Export Citation
  • Saxe, A. M., J. L. McClelland, and S. Ganguli, 2014: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv, 1312.6120v3, https://doi.org/10.48550/arXiv.1312.6120.

  • Seity, Y., P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson, 2011: The AROME-France convective-scale operational model. Mon. Wea. Rev., 139, 976991, https://doi.org/10.1175/2010MWR3425.1.

    • Search Google Scholar
    • Export Citation
  • Sergeev, A., and M. D. Balso, 2018: Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv, 1802.05799v3, https://doi.org/10.48550/arXiv.1802.05799.

  • Sha, Y., D. J. Gagne II, G. West, and R. Stull, 2020: Deep-learning-based gridded downscaling of surface meteorological variables in complex terrain. Part II: Daily precipitation. J. Appl. Meteor. Climatol., 59, 20752092, https://doi.org/10.1175/JAMC-D-20-0058.1.

    • Search Google Scholar
    • Export Citation
  • Vincendon, B., V. Ducrocq, O. Nuissier, and B. Vié, 2011: Perturbation of convection-permitting NWP forecasts for flash-flood ensemble forecasting. Nat. Hazards Earth Syst. Sci., 11, 15291544, https://doi.org/10.5194/nhess-11-1529-2011.

    • Search Google Scholar
    • Export Citation
  • Weaver, A. T., and I. Mirouze, 2013: On the diffusion equation and its application to isotropic and anisotropic correlation modelling in variational assimilation. Quart. J. Roy. Meteor. Soc., 139, 242260, https://doi.org/10.1002/qj.1955.

    • Search Google Scholar
    • Export Citation
  • Xu, R., X. Wang, K. Chen, B. Zhou, and C. C. Loy, 2020: Positional encoding as spatial inductive bias in GANs. arXiv, 2012.05217v1, https://doi.org/10.48550/ARXIV.2012.05217.

  • Zhang, H., I. Goodfellow, D. Metaxas, and A. Odena, 2019: Self-attention generative adversarial networks. arXiv, 1805.08318v2, https://doi.org/10.48550/arXiv.1805.08318.

  • Zhang, R., 2019: Making convolutional networks shift-invariant again. Proc. 36th Int. Conf. on Machine Learning, Long Beach, CA, PMLR, 7324–7334, http://proceedings.mlr.press/v97/zhang19a.html.

  • Fig. 1.

    (left) The full AROME domain is shown, along with the 128 × 128 subdomain used for the GAN training. (top right) Its main geographical features are represented on a topographic map (altitude; m). (bottom right) Dataset organization with its main variability directions. Each sample is a 3 × 128 × 128 array, corresponding to a “volume element” of the “dataset box” shown at the bottom right.