• View in gallery
    Fig. 1.

    Cropped screenshots of the MeteoSwiss smartphone app, showing different ways to seamlessly combine measurements and forecasts in the same visualization. (left) Air temperature and precipitation at a location of interest, visualized as line and bar charts. Pictograms provide additional summaries of cloud cover, sunshine, and precipitation forecasts. (center) Animation of radar maps. By pushing the time slider into the future, the visualization provides a seamless transition from measurement to forecast. (right) Time-lapse video of past and present images taken by the MeteoSwiss web camera in the vicinity. We envision an analogous transition from observed to synthesized images by pushing the time slider into the future.

  • View in gallery
    Fig. 2.

    (top) A sequence of images I0, …, I6, taken by the Flüela camera in the Swiss Alps between 1000 and 1600 UTC 2 Jul 2020. (middle) A nowcasting visualization created using analog retrieval of individual images from an annotated archive (see section 3 for a description of the data). Here, I^tind is the individual image from the archive where the associated weather descriptor w^t has the smallest Euclidean distance to the forecast wt. (bottom) A visualization created using analog retrieval of image sequences. I^0seq,,I^6seq is the image sequence from the archive where the associated weather descriptors, concatenated as the vector (w^0,,w^6), have the smallest Euclidean distance to (w0,…,w6).

  • View in gallery
    Fig. 3.

    A pair of (left) generated and (right) real camera images, where G0 was trained by minimizing the expected absolute difference of pixel values (the L1 distance). While the ground and illumination conditions match quite well, the synthesized sky lacks detail because it is not possible to predict the exact location and shape of the clouds from the weather forecast.

  • View in gallery
    Fig. 4.

    The conceptual encoder–decoder architecture of the generator network G:I0,w0,wt,zI^t. Each encoder stage Es halves the layer height and width and doubles the depth (blue dotted arrows), while each decoder stage Ds performs the inverse (orange dashed arrows). The output of Es is also concatenated to the output of the corresponding Ds (green dot–dashed arrows), providing additional long range connections that skip the in-between stages. A transformation of the random input z is concatenated to the output of the innermost encoder stage E5. The complete architecture of G has additional pre- and postprocessing blocks and two more stages than shown in this figure. A full schematic of the network (including layer shape information) is available from the companion repository (https://zenodo.org/record/6962721/files/generator.png).

  • View in gallery
    Fig. 5.

    An example of visible artifacts introduced by a generator architecture that is based on residual blocks. Clouds in (left) the input image are still partially visible in the clear-sky regions of (right) the output image because the residual transformation learned by the generator does not fully cancel their appearance.

  • View in gallery
    Fig. 6.

    Examples of artifacts introduced by the upsampling method used in the generator. (left) Nearest-neighbor upsampling often produced axis-aligned cloud patterns, while (right) bilinear upsampling produced bloblike cloud shapes with a smooth boundary. Using transposed convolutions in the decoder stages avoided both problems.

  • View in gallery
    Fig. 7.

    Examples of generated images that (left) were judged to look realistic and (right) contain obvious artifacts such as repeating cloud patterns.

  • View in gallery
    Fig. 8.

    (top) A sequence of images taken by the Cevio camera between 1000 and 1600 UTC 6 Feb 2020, and (bottom) the corresponding images generated by G(I0, z|w0, wt). This visualization satisfies our four evaluation criteria. The generated images look realistic and are free of artifacts. They match the real images of the future w.r.t. atmospheric, ground, and illumination conditions. The transition from observation to forecast is seamless: I0 (at top left) is well approximated by I^0 (at bottom left), and I^1 retains the persisting weather conditions of I0. Finally, there is good visual continuity, because the progression of shadows appears natural.

  • View in gallery
    Fig. 9.

    The cloud cover amount in (left) the generated image is too large, compared to (right) the observed conditions at Cevio on 1200 UTC 25 Apr 2020. But the mismatch in cloud cover is not a failure of the visualization method, as the COSMO-1 forecast predicts a 100% cloud area fraction in the medium troposphere.

  • View in gallery
    Fig. 10.

    (top) A sequence of images taken by the Etziken camera between 1400 and 2000 UTC 6 Feb 2020 and (bottom) the corresponding images generated by G(I0, z|w0, wt). The generator was able to fully transform I0 (at top left) from daylight to nighttime conditions, including the visual appearance of illuminated street lamps and windows at t = 4 h.

  • View in gallery
    Fig. 11.

    (top) A sequence of images taken by the Cevio camera between 1000 and 1600 UTC 17 Jun 2020 and (bottom) the corresponding images generated by G(I0, z|w0, wt). The generator synthesizes the correct scenery (the Pizzo Paràula mountain in the center of the image) that is occluded at t = 0 and becomes visible at t = 5 h.

  • View in gallery
    Fig. 12.

    (top) The same sequence of images as in the top row of Fig. 2, taken by the Flüela camera between 1000 and 1600 UTC 2 Jul 2020. (bottom) Because I^t=G(I0,z|w0,wt) is a transformation of I0, the exact position and shape of snow patches are retained in I^1,I^2,,I^6.

  • View in gallery
    Fig. 13.

    (top) A sequence of images taken by the Flüela camera between 0600 and 1200 UTC 12 Jan 2020. (bottom) The generator accurately transformed the illumination conditions and shadows, but it failed to learn how the sun moves across the sky. Instead of shifting the position of the sun, the sun stays in the same position and fades away gradually.

  • View in gallery
    Fig. 14.

    (top) A sequence of images taken by the Cevio camera between 1000 and 1600 UTC 17 Feb 2020. (bottom) The generated images accurately match the ground and illumination conditions, and the development of cloud shapes appears natural. But the positions of individual clouds remain too static over time.

  • View in gallery
    Fig. 15.

    The effect of the noise variance on the diversity of visualizations synthesized by the generator. Increasing σ when sampling ziN(0,σ2) leads to a greater diversity of images that are deemed consistent with the weather forecast. But note that the realized diversity is also a function of t: it is smallest at t = 0, and greatest when there is a change of weather conditions at t = 3 h. (top) A sequence of images taken by the Cevio camera between 0600 and 1200 UTC 16 Mar 2020. (bottom rows) Nowcasting visualizations generated with σ = (0, 0.2, 0.5, 1.0).

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 214 214 214
PDF Downloads 197 197 197

Photographic Visualization of Weather Forecasts with Generative Adversarial Networks

Christian SiggaFederal Office of Meteorology and Climatology MeteoSwiss, Zurich, Switzerland

Search for other papers by Christian Sigg in
Current site
Google Scholar
PubMed
Close
,
Flavia CavallarobComerge, Zurich, Switzerland

Search for other papers by Flavia Cavallaro in
Current site
Google Scholar
PubMed
Close
,
Tobias GünthercFriedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Search for other papers by Tobias Günther in
Current site
Google Scholar
PubMed
Close
, and
Martin R. OswalddETH Zürich, Zurich, Switzerland
eUniversity of Amsterdam, Amsterdam, Netherlands

Search for other papers by Martin R. Oswald in
Current site
Google Scholar
PubMed
Close
Free access

Abstract

Outdoor webcam images jointly visualize many aspects of the past and present weather, and since they are also easy to interpret, they are consulted by meteorologists and the general public alike. Weather forecasts, in contrast, are communicated as text, pictograms, or charts, each focusing on separate aspects of the future weather. We therefore introduce a method that uses photographic images to also visualize weather forecasts. This is a challenging task because photographic visualizations of weather forecasts should look real and match the predicted weather conditions, the transition from observation to forecast should be seamless, and there should be visual continuity between images for consecutive lead times. We use conditional generative adversarial networks to synthesize such visualizations. The generator network, conditioned on the analysis and the forecasting state of the numerical weather prediction (NWP) model, transforms the present camera image into the future. The discriminator network judges whether a given image is the real image of the future, or whether it has been synthesized. Training the two networks against each other results in a visualization method that scores well on all four evaluation criteria. We present results for three camera sites across Switzerland that differ in climatology and terrain. We show that even experts struggle to distinguish real from generated images, achieving only a 59% accuracy. The generated images match the atmospheric, ground, and illumination conditions visible in the true future images in 67% up to 99% of cases. Nowcasting sequences of generated images achieve a seamless transition from observation to forecast and attain good visual continuity.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Christian Sigg, christian.sigg@meteoswiss.ch

Abstract

Outdoor webcam images jointly visualize many aspects of the past and present weather, and since they are also easy to interpret, they are consulted by meteorologists and the general public alike. Weather forecasts, in contrast, are communicated as text, pictograms, or charts, each focusing on separate aspects of the future weather. We therefore introduce a method that uses photographic images to also visualize weather forecasts. This is a challenging task because photographic visualizations of weather forecasts should look real and match the predicted weather conditions, the transition from observation to forecast should be seamless, and there should be visual continuity between images for consecutive lead times. We use conditional generative adversarial networks to synthesize such visualizations. The generator network, conditioned on the analysis and the forecasting state of the numerical weather prediction (NWP) model, transforms the present camera image into the future. The discriminator network judges whether a given image is the real image of the future, or whether it has been synthesized. Training the two networks against each other results in a visualization method that scores well on all four evaluation criteria. We present results for three camera sites across Switzerland that differ in climatology and terrain. We show that even experts struggle to distinguish real from generated images, achieving only a 59% accuracy. The generated images match the atmospheric, ground, and illumination conditions visible in the true future images in 67% up to 99% of cases. Nowcasting sequences of generated images achieve a seamless transition from observation to forecast and attain good visual continuity.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Christian Sigg, christian.sigg@meteoswiss.ch

1. Introduction

Outdoor web camera (webcam) images visualize past and present weather conditions and are consulted by meteorologists and the general public alike, for example, in aviation weather nowcasting or the planning of a leisure outdoor activity. A single image jointly visualizes multiple aspects of the past and present weather: it simultaneously provides information about atmospheric, ground, and illumination conditions, such as cloud cover, visibility, precipitation, snow cover, or sunshine. And even though the information density of this visualization medium is high, the information is presented in a manner that is intuitive and easily accessible. For these reasons, MeteoSwiss currently operates 40 cameras on measurement sites of its surface network and provides the images to the public through its web page1 and smartphone app (see Fig. 1). Rega (the Swiss air rescue service) operates a similar camera network for aviation weather forecasting and flight route planning. Private companies offer cameras as a service to communities and the tourism industry, operating hundreds of outdoor web cameras all across Switzerland.

Fig. 1.
Fig. 1.

Cropped screenshots of the MeteoSwiss smartphone app, showing different ways to seamlessly combine measurements and forecasts in the same visualization. (left) Air temperature and precipitation at a location of interest, visualized as line and bar charts. Pictograms provide additional summaries of cloud cover, sunshine, and precipitation forecasts. (center) Animation of radar maps. By pushing the time slider into the future, the visualization provides a seamless transition from measurement to forecast. (right) Time-lapse video of past and present images taken by the MeteoSwiss web camera in the vicinity. We envision an analogous transition from observed to synthesized images by pushing the time slider into the future.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

However, these resources have not yet been utilized for forecast visualization. Instead, weather forecasts are communicated as text, numbers, pictograms, or charts (Fig. 1, left and center). Each of these communication media focuses on separate aspects of the future weather (e.g., air temperature and precipitation amount in the chart of Fig. 1, left). We therefore introduce a novel visualization method for weather forecasts that synthesizes future images of an outdoor web camera. As with the animation of radar maps (Fig. 1, center), we imagine a time slider that can be pushed beyond the present into the future (Fig. 1, right), to provide a seamless transition from observation to forecast.

Photographic visualizations of weather forecasts could have multiple applications. Meteorological services could use them to communicate localized forecasts over their own webcam feeds, smartphone apps, and other distribution channels. They could also provide a service to communities and tourism organizations for creating forecast visualizations that are specific for their webcam feeds.

The introduction is structured as follows: We start by proposing four evaluation criteria in section 1a that a successful photographic visualization should satisfy. We then introduce two baseline methods in section 1b that use analog image retrieval for photographic visualization and discuss their fundamental limitations w.r.t. the proposed criteria. These limitations motivate our use of image synthesis (section 1c) instead of image retrieval. Section 1d introduces generative adversarial networks (GANs) as our method of choice for image synthesis, presents related work for GAN design and the specific choices that we made for the method to perform well on the proposed evaluation criteria. Finally, in section 1e we discuss other related work that has made use of GANs for meteorological applications.

a. Evaluation criteria

Forecast visualizations must satisfy several criteria to achieve their purpose. We propose the following four to evaluate the quality of photographic visualizations of weather forecasts:

  1. Realism: The images should look real and be free of obvious artifacts. Ideally, it should not be possible to tell whether a given image was taken by an actual camera, or if it was synthesized by a visualization method.

  2. Matching future conditions: The images should match the future atmospheric, ground, and illumination conditions in the view of the camera. However, matching every pixel of a future observed image is not possible, as the forecast does not uniquely determine the positions and shapes of clouds, for example.

  3. Seamless transition: The visualization method should achieve a seamless transition from observation to forecast. It should reproduce the present image and must retain the present weather conditions as long as they persist into the future.

  4. Visual continuity: The method should attain visual continuity between images for consecutive lead times. For example, ground and illumination conditions should not show unnatural changes between images.

b. Analog retrieval

For a fixed camera view, such visualizations could be created using analog retrieval from an annotated image database. Given an archive of past images from the camera, annotated with the weather conditions that were present at the time the picture was taken, the forecast could be visualized by retrieving the image that most closely matches the predicted conditions.

Analogs can be retrieved as individual images or sequences of images. Using per-image analog retrieval, I^tind is the individual image from the archive where the associated weather descriptor w^t is most similar to the forecast wt for lead time t (e.g., as measured by Euclidean distance). As can be seen in the middle row of Fig. 2, using per-image analogs prioritizes matching the visible future weather conditions (our second evaluation criterion), but sacrifices the visual continuity between consecutive visualizations (our fourth criterion), with ground and illumination conditions changing abruptly between images.

Fig. 2.
Fig. 2.

(top) A sequence of images I0, …, I6, taken by the Flüela camera in the Swiss Alps between 1000 and 1600 UTC 2 Jul 2020. (middle) A nowcasting visualization created using analog retrieval of individual images from an annotated archive (see section 3 for a description of the data). Here, I^tind is the individual image from the archive where the associated weather descriptor w^t has the smallest Euclidean distance to the forecast wt. (bottom) A visualization created using analog retrieval of image sequences. I^0seq,,I^6seq is the image sequence from the archive where the associated weather descriptors, concatenated as the vector (w^0,,w^6), have the smallest Euclidean distance to (w0,…,w6).

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

Using sequence analog retrieval, I^0seq,,I^6seq is the image sequence where the associated weather descriptors (w^0,,w^6) have the smallest distance to (w0, …, w6). As can be seen in the bottom row of Fig. 2, this method satisfies the fourth criterion by construction. However, to score well on the second criterion would require that the archive contains full sequences of images (and not just individual ones) that match the forecasted conditions. Consequently, sequence analogs poorly match the future atmospheric, ground, and illumination conditions, even if the archive contains a full year of past data (see Table 4 for our results).

Both analog methods satisfy the first criterion by construction, since the images were taken by the actual camera, but they score poorly on the third criterion (seamless transition from observation to forecast). Because webcam images jointly visualize many aspects of the visible weather, it is practically unlikely that the archive contains an image that exactly matches all of them, including the specific position and shape of clouds, ground conditions, illumination, etc. We found that one year of archive data was clearly insufficient (see section 4c for results), so one would have to collect multiple years of archive data to increase this likelihood, and therefore would have to wait this long before the method could be deployed. This severely limits the practical use of analog methods for our application.

c. Image synthesis

We therefore use image synthesis instead of analog image retrieval, which can be formalized as a regression problem in the following way. Given the forecast wt for lead time t, the generator G0:wt    I^t synthesizes a corresponding image I^t. This image should closely match the real future image It, that is, the dissimilarity measured by a suitable loss function L(I^t,It) should be small. Here, G0(w; θ) is a neural network with parameters θ, and the optimal parameters θ* are found by minimizing the expected loss
θ*=argminθEwt,It[L[G0(wt;θ),It]],
over pairs (wt, It) of weather forecasts and the corresponding real camera images, where Ewt,It[] is the expected value w.r.t. the joint distribution of forecasts and real images. If the expected loss is small, then G0 scores well on our second evaluation criterion.

The choice of L is not obvious, however. Common regression losses, such as the L2 or L1 distance of pixel intensities, are not suitable for our application, since it is not possible to predict the exact location and shape of clouds from the weather forecast. Thus, a pixel-by-pixel equivalence of I^t and It should not be sought. In fact, using a pixelwise loss function results in generated images that show a mostly uniform sky (see Fig. 3), unless G0 is trained until it overfits and reproduces examples from the training data. Instead, L should measure how well I^t matches the overall atmospheric, ground, and illumination conditions of It. That is, a human examiner should not be able to tell whether I^t or It is the true camera image of the future, even though they will not be identical.

Fig. 3.
Fig. 3.

A pair of (left) generated and (right) real camera images, where G0 was trained by minimizing the expected absolute difference of pixel values (the L1 distance). While the ground and illumination conditions match quite well, the synthesized sky lacks detail because it is not possible to predict the exact location and shape of the clouds from the weather forecast.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

d. GANs

It is unclear how to design such a loss function by hand. Goodfellow et al. (2014) introduced GANs as a method for learning the loss function from training data instead. Here, G1:z    I^ synthesizes an image from a random input z, which is sampled from a suitable distribution zp(z), for example the Gaussian distribution. A discriminator D1:I0    [0,1] is introduced to mimic the human expert and to estimate the probability that the examined image I is a real image, instead of having been synthesized by the generator. The discriminator D1(I; η) is also a neural network, with parameters η. Here, G1 and D1 are trained jointly and in an adversarial fashion by optimizing the objective:
minθmaxηEI[logD1(I;  η)]+Ez[log{1D1[G1(z;  θ);  η]}].
In this minimum–maximum optimization, the generator aims to fool the discriminator, and the discriminator tries to correctly distinguish between real and generated images. The training is complete when G1 generates images with a high degree of realism and D1 can no longer distinguish between real and generated images, thus scoring well on our first evaluation criterion (see section 4a for our results).

To synthesize images that not only look realistic, but also correspond to the weather forecast at lead time t (our second evaluation criterion), wt is provided as an additional input to both the generator and discriminator, G2(z|wt; θ) and D2(I|wt; η), in what is called conditional adversarial training (Mirza and Osindero 2014). See section 4b for details (in Table 3) of how well the synthesized images match the visible future weather conditions according to the criteria introduced there.

Our third goal is a seamless transition from the present image I0 to generated future images I^t. For t = 0, the generator should therefore reproduce the present image, I^0=I0. This can be achieved by extending the generator and discriminator once more. Instead of synthesizing images from a random input z, G3(I0, z|w0, wt) transforms the current image I0 into I^t, based on the analysis state w0 and the forecast wt of the NWP model. z adds a random component to the transformation, enabling the generator to synthesize more than one image that is consistent with the forecast when t > 0 (see Fig. 15). The discriminator D3(I|I0, w0, wt) is also conditioned on the full input. For the rest of the paper, we drop the subscripts and refer to G3 and D3 as G and D.

GAN-based image transformation was introduced by Isola et al. (2017). The authors trained the networks using a linear combination of the regression objective in Eq. (1) and the adversarial objective in Eq. (2). The L1 norm was used as the pixelwise regression loss, and the relative importance of the regression objective was set using a tuning parameter λ. We have found that in our application, setting λ > 0 sped up the training of G for synthesizing the ground but was detrimental for synthesizing clouds. Because their positions and shapes evolve over time, encouraging a pixelwise consistency was again not helpful and resulted in a sky lacking structure (as in the pure regression setting, Fig. 3). We therefore only use an adversarial objective for the training of G and D.

When the atmosphere is relatively stable on the scale of the lead time step size, there will be only small changes from It to It+1 in the position and shape of clouds. This should be reflected in the sequence of generated images I^0,I^1,I^2,. The results in section 4d show that the generated sequences have a good degree of temporal continuity (the fourth criterion), even though we do not explicitly model the statistical dependency between consecutive lead times t and t + 1. Recurrent GANs (Mogren 2016) or adversarial transformers (Wu et al. 2020) are two potential avenues for further work in this area.

The pair of neural networks G and D can be trained for specific or arbitrary camera views. A view-independent generator could enable novel interactive applications, such as providing on-demand forecast visualizations for users equipped with a smartphone camera and a global navigation satellite system (GNSS) receiver. However, a view-independent generator will show its limits when the present view of the scenery is partially or fully blocked by opaque clouds or fog, and wt predicts a better visibility in the future. Although G could generate natural looking scenery for the newly visible regions (Yu et al. 2018), it will not be the real scenery of this location, thus potentially confusing the user. We therefore only consider the view-dependent case in this paper, where the synthesized and real scenery match very well (see Fig. 11).

e. Related work

We are not aware of prior work that synthesizes web camera images to visualize weather forecasts. But GANs have and are being used as powerful tools for several related meteorological and climatological applications.

GANs have been used for the statistical downscaling and nowcasting of specific atmospheric fields, such as radar-measured precipitation (cf. Fig. 1, center) and cloud optical thickness. For example, Leinonen et al. (2021) and Price and Rasp (2022) developed stochastic superresolution GANs to generate ensembles of time-evolving high-resolution fields from low-resolution input fields. Ravuri et al. (2021) made use of adversarial training for probabilistic precipitation nowcasting, improving the forecast quality over previous deep learning approaches (e.g., Sønderby et al. 2020) by producing realistic and spatiotemporally consistent predictions, thus avoiding the problem of blurring and improving the skill on mid- to high-intensity rainfall.

GANs have also been used to transform images from one kind of visible weather to another. For example, Qu et al. (2019) used pix2pix for image dehazing, that is, improving the meteorological visibility by removing the effects of atmospheric scattering due to aerosols. Schmidt et al. (2019) developed a GAN to simulate the effects of flooding on user-provided images, visualizing the possible effects of catastrophic climate change. And Li et al. (2021) developed a cycle-consistent adversarial network (cf. Zhu et al. 2020) to transform images labeled as having either sunny, cloudy, foggy, rainy, or snowy weather. Transforming images from one distinct type of weather to another could in principle also be used for our task of forecast visualization. But describing the joint atmospheric, ground, and illumination conditions with a single discrete class label would result in a combinatorially large number of classes, rendering this approach impractical for our task. Recently, Requena-Mesa et al. (2021) proposed to synthesize satellite imagery conditioned on future weather and published a high-resolution dataset containing Sentinel-2 images, matching topography and meteorological variables. The three baseline models published by the authors do not include a GAN but incorporating adversarial training would be a natural extension of their Channel-U-Net model to possibly improve the forecast skill.

2. Method

Neural networks have many design degrees of freedom, such as the number of layers, the number of neurons per layer, or whether to include skip connections between layers. The choice of the loss function and the optimization algorithm are also important, especially so for training GANs. Because the generator and discriminator are trained against each other, the optimization landscape is changing with every step. Possible training failure modes are a sudden divergence of the objective (sometimes after making progress for hundreds of thousands of optimization steps), or a collapse of diversity in the generator output (Goodfellow et al. 2014). Past research therefore has focused on finding network architectures (e.g., Radford et al. 2016), optimization algorithms (e.g., Heusel et al. 2017), and regularization schemes (e.g., Miyato et al. 2018) that increase the likelihood of training success.

The following subsections will present the network architecture (section 2a), optimization algorithm (section 2b), and weight regularization scheme (section 2c) that produced the results discussed in section 4. We also briefly mention alternatives that we tried but that did not lead to consistent improvements for our application.

a. Network architecture

Both G and D are encoder–decoder networks with skip connections (see Fig. 4), similar to the U-Net of Ronneberger et al. (2015). At each stage Es of the encoder, the layer height and width are halved and the number of channels (the depth) is doubled, while the inverse is done at each decoder stage Ds. By concatenating the output of Es to the output of the corresponding Ds, long-range connections are introduced that skip the in-between stages.

Fig. 4.
Fig. 4.

The conceptual encoder–decoder architecture of the generator network G:I0,w0,wt,zI^t. Each encoder stage Es halves the layer height and width and doubles the depth (blue dotted arrows), while each decoder stage Ds performs the inverse (orange dashed arrows). The output of Es is also concatenated to the output of the corresponding Ds (green dot–dashed arrows), providing additional long range connections that skip the in-between stages. A transformation of the random input z is concatenated to the output of the innermost encoder stage E5. The complete architecture of G has additional pre- and postprocessing blocks and two more stages than shown in this figure. A full schematic of the network (including layer shape information) is available from the companion repository (https://zenodo.org/record/6962721/files/generator.png).

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

Each encoder stage Es consists of three layers: a strided convolution (Conv) layer, followed by a batch normalization (BN) layer (Ioffe and Szegedy 2015) and a ReLU activation layer. The decoder stages have the same layer structure, except that a transposed convolution is used in the first layer.

A full specification and implementation of the networks is provided in the companion repository.2 Here, we summarize the important architectural elements.

1) Generator

The input to G consists of the three color channels of I0 and the weather descriptors. The latter are concatenated as w = (w0, wt), and repeated and reshaped to form additional input channels, such that each channel is the constant value of an element of w. After a Conv-BN-ReLU preprocessing block, there are five encoder stages E1,E2,,E5 that progressively halve the input height and width and double the depth, except for the last stage, where the depth is not doubled anymore to conserve graphics processing unit (GPU) memory.

The random input consists of 100 elements ziN(0,1), each sampled from a Gaussian distribution with zero mean and unit variance. A dense linear layer and a reshaping layer transform (z1,z2,,z100) into a tensor with the same height and width as the innermost encoder stage and a depth of 128 channels.

The input to the innermost decoder stage D4 consists of the output of the last encoder stage E5, concatenated with the transformed random input. Five decoder stages D4,D3,,D0 restore the original height and width, followed by a Conv-BN-ReLU postprocessing block and a final 1 × 1 convolution with a hyperbolic tangent activation function that regenerates the three color channels.

2) Discriminator

The input to D consists of the color channels of I0 and It, and the weather descriptors w0 and wt, which are transformed into additional input channels as in G. The encoder and decoder again have five stages each, and there is a Conv-BN-ReLU preprocessing block before the first encoding stage and a postprocessing block after the last decoding stage.

The output of the discriminator has two heads to discriminate between real and generated images on the patch and on the pixel level (Schonfeld et al. 2020). The patch-level output Dp (see section 2b) is computed by combining the output channels of the last encoder stage using a 1 × 1 convolution. The pixel-level output Dij is computed by a 1 × 1 convolution of the postprocessing output.

We tried alternative network architectures based on residual blocks (which add the block input to its output; see He et al. 2015), similar to the BigGAN architecture (Brock et al. 2019). While a generator using residual blocks learned to transform the ground faster than our final generator, it also created visible artifacts when transforming cloudy skies, see Fig. 5 for an example. We conjecture that residual blocks are less suited for transformations of objects that move or evolve their shape, because the generator has to learn a transformation that perfectly cancels them from the block input if they are not to appear in the output.

Fig. 5.
Fig. 5.

An example of visible artifacts introduced by a generator architecture that is based on residual blocks. Clouds in (left) the input image are still partially visible in the clear-sky regions of (right) the output image because the residual transformation learned by the generator does not fully cancel their appearance.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

We also tried replacing the transposed convolution layers in the decoder stages with nearest-neighbor or bilinear interpolation followed by regular convolution, which was suggested by Odena et al. (2016) to avoid checkerboard artifacts. This kind of artifact was not noticeable in the output of our final generator architecture, while nearest-neighbor upsampling often produced axis-aligned cloud patterns, and bilinear upsampling often produced overly smooth clouds, see Fig. 6.

Fig. 6.
Fig. 6.

Examples of artifacts introduced by the upsampling method used in the generator. (left) Nearest-neighbor upsampling often produced axis-aligned cloud patterns, while (right) bilinear upsampling produced bloblike cloud shapes with a smooth boundary. Using transposed convolutions in the decoder stages avoided both problems.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

Adding a second Conv-BN-ReLU block to each Es and Ds also did not lead to a significant improvement, while doubling the number of network weights that had to be trained.

b. Training objective and optimizer

The training objective for G and D is an extension of Eq. (2). We omit the dependency on the trainable network weights θ and η in the following equations for the sake of brevity.

The discriminator objective to be maximized consists of a sum of three components. The first two components measure how well the patch-level head Dp of the discriminator can distinguish between real
EI0,w0,It,wt[plogDp(It|I0,w0,wt)],
and generated images
EI0,w0,wtEz[plog{1Dp[G(I0,z|w0,wt)|]}].
The third component measures how well the pixel-level head Dij can distinguish between the real and generated pixels of a random cut-mix composite C (Yun et al. 2019):
EC[ijMijDij(C)+(1Mij)log[1Dij(C)]].
The cut-mix operator combines a real and a generated image into a composite image C using a randomly generated pixel mask M, where Mij = 0 if the pixel Cij comes from the generated image, and Mij = 1 otherwise; see Fig. 3 in Schonfeld et al. (2020) for an illustration. Cut-mixing augments the training data, and Mij serves as the target label for the pixel-level head Dij of the discriminator. For our application, we apply the cut-mix operator to all the input channels of the discriminator, including the channels corresponding to the weather descriptors.
The generator objective to be minimized also consists of a sum of three components. The first two components measure how much the generator struggles to fool the discriminator on the patch level
EI0,w0,wtEz[plog{Dp[G(I0,z|w0,wt)|...]}],
and the pixel level
EI0,w0,wtEz[ijlog{Dij[G(I0,z|w0,wt)|]}].
The third component measures how similar two generated images look at the pixel level, given two different random inputs z1 and z2,
EI0,w0,wtEz1,z2[ijc|Gijc(I0,z1|...)Gijc(I0,z2|...)|],
where Gijc is the intensity of channel c at pixel location (i, j). Including this component in the objective for the generator avoids the problem of mode collapse, where the generator ignores z and produces a deterministic output given (I0, w0, wt). It also encourages the generator to make use of all stages of the encoder–decoder architecture, since z is injected at the innermost stage.

We evaluated three different optimization algorithms to train θ and η: stochastic gradient descent, rmsprop (Hinton et al. 2012), and Adam (Kingma and Ba 2017). Adam achieved the fastest improvement rate, but it could suffer from erratic spikes in the loss curve. Spectral normalization of the training weights (see section 2c) and small learning rates were necessary to achieve a smooth training progress.

We also evaluated multiple discriminator updates per generator update, and different learning rates for the training of G and D. We found that using a 2-times-faster learning rate for the discriminator (Heusel et al. 2017) achieved the best results and used 5 × 10−5 as the learning rate for the generator and 1 × 10−4 for the discriminator. The other Adam hyperparameters were set to β1 = 0 and β2 = 0.9.

c. Spectral normalization

GANs are difficult to train, because the optimization landscape of the adversarial training changes with every iteration. The loss can spike suddenly (after many iterations of consistent improvement), or the training can stall completely if the discriminator becomes too good at distinguishing real from generated images.

We use spectral normalization of trainable weights (Miyato et al. 2018) in all convolution layers to enforce Lipschitz continuity of the discriminator. As in Zhang et al. (2019), we found that using spectral normalization also in the generator further stabilizes the training.

3. Data

We present results for three camera sites that belong to the networks operated by MeteoSwiss and Rega, where we have access to an archive of past images. We chose sites that show a good diversity in terms of terrain and weather conditions. The camera at Cevio (e.g., Fig. 3) is located in the Maggia Valley, in the southern part of Switzerland at an elevation of 421 m MSL. The camera at Etziken (Fig. 10) is located on the Swiss main plateau, at an elevation of 524 m MSL. The camera at Flüela (e.g., Fig. 7) is located on a mountain pass in the Alps, at an elevation of 2177 m MSL.

Fig. 7.
Fig. 7.

Examples of generated images that (left) were judged to look realistic and (right) contain obvious artifacts such as repeating cloud patterns.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

Each camera takes a picture every 10 min, resulting in 144 images per day. We excluded images from our training and testing datasets that are not at all usable, such as when the camera moving head was stuck pointing to the ground, or the lens was completely covered with ice, but there was otherwise no need to clean the dataset. Images were retained if there were water droplets on the lens, or if there was a minor misalignment of the camera moving head. We also did not exclude particular weather conditions such as full fog. Nevertheless, there are gaps of several days in the data series, due to camera failures that could not be fixed in a timely manner.

To limit the memory consumption and to speed up the training on the available hardware (26 GB RAM on a single NVIDIA A100 GPU), the images were downscaled from their original size to 64 × 128 pixels. Choosing powers of two for the height and width facilitates the down- and upsampling in the encoder–decoder architecture (section 2a). The original aspect ratios were restored using an additional horizontal resizing operation.

The weather descriptor w consists of the time of day, day of year, and a subset of the hourly forecast fields provided by the COSMO-1 model, evaluated at the location of the camera (see Table 1). The subset selection was made in consultation with the COSMO-1 experts at MeteoSwiss. Our goal was to define a small set of features that are predictive for the atmospheric, illumination, and ground conditions that are visible in the image.

Table 1

The weather descriptor w consists of the time of day, day of year, and the following subset of COSMO-1 output fields (Schättler et al. 2021), evaluated at the location of the camera.

Table 1

Unfortunately, a follow-up numerical evaluation and optimization of the set of descriptors was not feasible. Computing Shapley additive explanations (SHAP) feature importance values (Lundberg and Lee 2017) was impossible, because the DeepExplainer and GradientExplainer implementations do not support all of the operators used in our model, and there are too many model parameters to use the model agnostic KernelExplainer. A brute-force optimization of the feature set (e.g., by leaving out a single descriptor and comparing the relative performance loss) was infeasible due to the long training time (several days for a single training run).

For the training and evaluation of the networks, we used the analysis (the initial state of the NWP model) as the optimal forecast for w0 and also for wt. This simplifies the evaluation of our visualization method, as it reduces the error between the NWP model state and the actual weather conditions at time t. For an operational use of the trained generator, where the analysis of the future is of course not available, one would use the current forecast for the lead time t instead, without having to modify the visualization method in any other way.

The analysis w0 and the forecast wt fail to accurately describe the actual weather conditions visible in the camera image. The COSMO-1 fields have a limited spatial and temporal resolution of 1 km and 1 h. The camera location can thus be atypical for the NWP model grid point, and changes in weather conditions can be observed by the camera before or after they become apparent in the forecast fields. Evaluating the fields at a single point can also be insufficient to describe all the weather conditions in the view of the camera. When comparing the weather conditions visible in It and I^t, we therefore have to distinguish between mismatches that are due to the weather descriptors (where w0 or wt do not accurately describe I0 or It), and mismatches that are due to the generator (where I^t does not properly visualize wt), see section 4b for additional discussion and our results.

We used data from the year 2019 for training and data from the year 2020 up to the end of August (when COSMO-1 was decommissioned at MeteoSwiss) for testing. The training datasets consist of all possible tuples (w0, wt, I0, It), where t varies from 0 to 6 h. The resulting number of tuples are 712 889 for Cevio, 450 486 for Etziken, and 493 096 for Flüela.

4. Results

We evaluate the forecast visualization method according to the four criteria that have been introduced in section 1a: how realistic the generated images look, how well they match future predicted and observed weather conditions, whether the transition from observation to forecast is seamless, and whether there is visual continuity between consecutive images.

Our evaluation is primarily perception based. Visualization methods transform abstract data into a concrete form that should be meaningful to humans. A perceptual evaluation of the created visualization is therefore the gold standard and is ideally accompanied by easy to compute quantitative measures. Developing such measures is challenging, however, for several reasons.

The visual fidelity of images created by GANs is largely independent of their log-likelihood (Theis et al. 2016). Commonly used measures to evaluate their realism, such as the inception score (IS) by Salimans et al. (2016) or the Fréchet inception distance (FID) by Heusel et al. (2017), have been shown to not generalize well beyond the dataset that was used to train the underlying neural network (Barratt and Sharma 2018). Computing the IS or FID with an Inception network that was pretrained on ImageNet (Deng et al. 2009), which mostly consists of images that are wholly unlike outdoor webcam images, would therefore result in scores that are likely uninformative or misleading. Instead, we would have to develop and validate a visual fidelity measure that is specific to our application—a substantial research project on its own, that we have to leave for future work.

An accurate visualization of future weather conditions does not imply a pixel-by-pixel correspondence to the true webcam image of the future (see section 1c). Instead, photographic visualizations have several undetermined degrees of freedom, such as the specific shape and position of clouds. Therefore, evaluation measures that are commonly used in the nowcasting literature, such as the pixel-based root-mean-square error (RMSE) or the continuous ranked probability score (CRPS) used in Leinonen et al. (2021), are not appropriate for determining how accurately the photographic visualization matches the true image of the future.

Photographic visualizations also show a superposition of several weather aspects, that is, they jointly visualize the atmospheric, ground, and illumination conditions of the visible weather. To measure how accurately each aspect is visualized (e.g., the amount of cloud cover, or the presence of snow on the ground), it has to be extracted from the superposition. But estimating the amount of cloud cover (Krinitskiy et al. 2021) or the prevailing visibility (Palvanov and Cho 2019) from camera images (to give just two examples) are still active areas of research.

There is one evaluation where a pixel-based RMSE is appropriate: to measure how accurately the visualization I^0 reproduces the current image I0, as they should be pixelwise identical, I^0=I0. Otherwise, we have used quantitative measures only to monitor the GAN training progress. The current lack of quantitative measures for our proposed evaluation criteria has no bearing on the practicability of our method, however. Training the GAN is entirely automated and does not need any expert input.

a. Realism

To evaluate the realism of individual generated images, we asked five coworkers at MeteoSwiss (who regularly consult the cameras of the MeteoSwiss network for their job duties, but otherwise were not involved in this project) to examine whether a presented image looks realistic or artificially generated.

The evaluation data was generated in the following way. For every camera, 75 pairs (I0, w0) were sampled from the test data uniformly at random, but t = 0 was limited from 0600 to 1400 UTC, to avoid too many similarly looking nighttime views. Then a lead time t was sampled from 0 to 360 min (in increments of 10 min), and both the real future image It and the generated image I^t=G(I0,z|w0,wt) were added to the evaluation dataset. This sampling strategy ensured that there was no overall difference in the meteorological conditions of the real and generated images, which otherwise could have influenced the examiners’ accuracy.

Each examiner was then assigned 30 randomly selected images from each camera and asked to provide their judgment on the realism of each presented image. The examiners could take as much time as was necessary to come to a decision, consistent with the Human Eye Perceptual Evaluation Infinity (HYPE) protocol proposed by Zhou et al. (2019) for the perceptual evaluation of generative models. The examiners were told that they would see real and generated images, but they did not know that the dataset was balanced, and neither did they keep a count of their judgments (i.e., to monitor whether their judgments were biased toward real or generated images). They could inspect the images at arbitrary zoom levels and also look at other images in the evaluation set before giving an answer. But they were not given any background information about the images, such as the date, the lead time of the forecast, or the predicted weather conditions.

The results of the evaluation are presented in Table 2. The overall accuracy of the examiners was 59% (corresponding to a HYPE score of 41%), with a 95% bootstrap confidence interval (CI) of 55%–64% (based on 10 000 bootstrap samples). The low accuracy of the examiners’ judgment indicates that it was challenging for them to distinguish between real and generated images (a 50% accuracy would correspond to random guessing). Many of the generated images look realistic enough to pass for a real image, but there were also instances where artifacts introduced by the generator were obvious at first glance (see Fig. 7 for an example).

Table 2

Results for the perceptual evaluation of real and generated images for the Cevio, Etziken, and Flüela cameras. Using a browser-based data labeling tool, images were presented to five examiners with the question “What is your first impression of this image?,” and they could answer either with “looks realistic” or “looks artificially generated.” Their answers were aggregated into a confusion matrix for each camera, where the rows correspond to the ground truth, and the columns correspond to the examiners’ judgment.

Table 2

b. Matching future conditions

To evaluate how well the forecast visualization I^t matches the future real image It, three examiners compared the atmospheric, ground, and illumination conditions visible in the images. As we do not expect I^t to match It in every detail (e.g., the precise shape and position of clouds do not matter), we used the following descriptive criteria to determine their overall agreement:

  • Atmospheric conditions

Cloud cover: clear sky | few | cloudy | overcast, cloud type: cumuliform | stratiform | stratocumuliform | cirriform, visibility: good | poor

  • Ground conditions

Dry | wet | frost | snow

  • Illumination

    Time of day: dawn | daylight | dusk | night, sunlight: diffuse | direct (casting shadows)

A mismatch between the actual and visualized conditions can happen because of two different kinds of failures. The forecast descriptor wt can fail to accurately capture the weather conditions visible in It, or the generated image I^t can be inconsistent with wt:

  1. 1) Inaccurate forecast: Besides COSMO-1 not having a perfect forecasting skill, evaluating the forecast fields only at the camera site can be insufficient to describe all of the visible weather. Furthermore, the spatial or temporal resolution of wt can be too coarse for highly variable weather conditions. For example, the hourly granularity of the forecast can only approximately predict the onset of rain.
  2. 2) Inconsistent visualization: The generator G can fail to properly account for the changes from w0 to wt in the transformation of I0 to I^t.

The three examiners compared 50 pairs (It,I^t) per camera (450 pairs in total), selected according to the same methodology as described in section 4a. For each pair, they determined whether the descriptive criteria match between the real and synthesized images It and I^t. To better understand if mismatches are due to either the first or second kind of failure, they also determined whether wt accurately described all the weather conditions visible in It, and whether I^t was consistent with wt.

Table 3 summarizes the results of the evaluation. In general, cloud cover and cloud type were the most difficult criteria to get right. At Cevio, for example, in 44 out of 150 cases (29%) the generated amount of cloud cover differed from the observed future conditions. But in only 9 of those 44 cases was the mismatch due to the visualization method, meaning that wt was accurate but I^t was inconsistent with it (see Fig. 9 for a counterexample). The visualization method would therefore benefit from improving the accuracy of the weather forecast, for example, by using a downscaled version of the COSMO-1 forecast with a higher spatial and temporal resolution.

Table 3

Evaluation of the matching atmospheric, ground, and illumination conditions between real and synthesized images It and I^t. Three experts compared 50 randomly selected pairs (It,I^t) per camera (450 pairs in total), according to the criteria specified in section 4b. They also examined whether the forecast wt accurately describes the conditions visible in It, and whether I^t is consistent with the forecast. All values are percentages of the possible maximum 150, and the values in parentheses are 95% bootstrap CIs. For values in bold font, the CIs do not overlap with the corresponding CIs for sequence analogs (see Table 4).

Table 3

Table 4 shows the corresponding evaluation for sequence analogs (section 1b). Comparing the number of matching conditions for GAN based image synthesis and analog sequence retrieval, we find that image synthesis achieves significantly more matches for cloud cover, cloud type, ground conditions, and time of day. It also creates visualizations that are overall 27 percentage points more consistent with the weather forecast.

Table 4

Evaluation of matching weather conditions for sequence analogs I^tseq (see section 1b), following the same procedure as the evaluation of synthesized images in Table 3.

Table 4

Figures 812 show further examples of the range of transformations that can be achieved by the generator, transforming the atmospheric and illumination conditions to match It, while retaining the specific ground conditions of I0.

Fig. 8.
Fig. 8.

(top) A sequence of images taken by the Cevio camera between 1000 and 1600 UTC 6 Feb 2020, and (bottom) the corresponding images generated by G(I0, z|w0, wt). This visualization satisfies our four evaluation criteria. The generated images look realistic and are free of artifacts. They match the real images of the future w.r.t. atmospheric, ground, and illumination conditions. The transition from observation to forecast is seamless: I0 (at top left) is well approximated by I^0 (at bottom left), and I^1 retains the persisting weather conditions of I0. Finally, there is good visual continuity, because the progression of shadows appears natural.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

Fig. 9.
Fig. 9.

The cloud cover amount in (left) the generated image is too large, compared to (right) the observed conditions at Cevio on 1200 UTC 25 Apr 2020. But the mismatch in cloud cover is not a failure of the visualization method, as the COSMO-1 forecast predicts a 100% cloud area fraction in the medium troposphere.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

Fig. 10.
Fig. 10.

(top) A sequence of images taken by the Etziken camera between 1400 and 2000 UTC 6 Feb 2020 and (bottom) the corresponding images generated by G(I0, z|w0, wt). The generator was able to fully transform I0 (at top left) from daylight to nighttime conditions, including the visual appearance of illuminated street lamps and windows at t = 4 h.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

Fig. 11.
Fig. 11.

(top) A sequence of images taken by the Cevio camera between 1000 and 1600 UTC 17 Jun 2020 and (bottom) the corresponding images generated by G(I0, z|w0, wt). The generator synthesizes the correct scenery (the Pizzo Paràula mountain in the center of the image) that is occluded at t = 0 and becomes visible at t = 5 h.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

Fig. 12.
Fig. 12.

(top) The same sequence of images as in the top row of Fig. 2, taken by the Flüela camera between 1000 and 1600 UTC 2 Jul 2020. (bottom) Because I^t=G(I0,z|w0,wt) is a transformation of I0, the exact position and shape of snow patches are retained in I^1,I^2,,I^6.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

c. Seamless transition

For a seamless transition between observation and forecast, the generator must be able to reproduce the input image I^0=I0 for t = 0. It must also retain the conditions of I0 in I^1,I^2, as long as they persist into the future.

We evaluated the third criterion on hourly nowcasts up to six hours into the future. As can be seen in Figs. 813, the generator reproduces the ground and illumination conditions of I0 very well in I^0. The shape and positions of clouds are reproduced closely but not exactly, giving the impression that the generator reconstructs the overall cloud pattern but not every small detail. The resulting pixelwise RMSE between the synthesized and real images I^0 and I0 is on average 4 times lower than for sequence analogs I^0seq: it is 1.11 × 10−2 versus 4.64 × 10−2, computed on 1000 pairs (I^0,I0) per camera. We could achieve pixel-level accurate reproductions of I0 by using a residual architecture for the generator network. But as discussed in section 2a, the realism of images generated by a residual network suffered from noticeable visual artifacts for t > 0.

Fig. 13.
Fig. 13.

(top) A sequence of images taken by the Flüela camera between 0600 and 1200 UTC 12 Jan 2020. (bottom) The generator accurately transformed the illumination conditions and shadows, but it failed to learn how the sun moves across the sky. Instead of shifting the position of the sun, the sun stays in the same position and fades away gradually.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

As can be seen in the bottom row of Fig. 12, the generator retains the specific conditions of I0 (such as the position and shape of snow patches) in the visualizations I^1,,I^6. This is only possible because G transforms I0 into I^t. As already shown in the corresponding Fig. 2, achieving a seamless transition is not feasible using analog retrieval (section 1b), because an image with the specific shapes and positions of clouds and snow patches is unlikely to exist in the archive.

d. Visual continuity

Finally, the consecutive visualizations I^t,I^t+1 must show visual continuity, with a natural looking cloud development, change of daylight, movement of shadows, and so on. As can be seen in Figs. 814, the evolution of the ground and illumination conditions matches the future observed images very closely, even though G and D do not explicitly account for the statistical dependency between t and t + 1. The increase or decrease in cloud cover and visibility also looks natural. By contrast, the visualizations based on individual analogs I^tind lack continuity (Fig. 2, middle row), with snow patches appearing and disappearing, and illumination conditions changing unnaturally.

Fig. 14.
Fig. 14.

(top) A sequence of images taken by the Cevio camera between 1000 and 1600 UTC 17 Feb 2020. (bottom) The generated images accurately match the ground and illumination conditions, and the development of cloud shapes appears natural. But the positions of individual clouds remain too static over time.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

But the generator struggled with learning image transformations that involve translations of objects across the camera view, such as the movement of the sun (Fig. 13) or of isolated clouds (Fig. 14). We conjecture that a network architecture based on Conv-BN-ReLU blocks is highly effective at transforming the appearance of objects that remain in place, but less so for translation operations. We therefore investigated whether including nonlocal self-attention layers (Zhang et al. 2019) could be beneficial. We did not achieve a consistent improvement in our experiments, neither with full nor with axis-aligned self-attention, while the network training time increased significantly.

The balance between visual continuity and image diversity can be tuned by changing the standard deviation σ of the random input ziN(0,σ2). Increasing σ leads to a greater diversity of visualizations that are deemed consistent with the weather forecast, while decreasing σ promotes greater continuity between subsequent images (see Fig. 15). We have found that setting σ = 0.5 results in the best possible trade-off between the two objectives.

Fig. 15.
Fig. 15.

The effect of the noise variance on the diversity of visualizations synthesized by the generator. Increasing σ when sampling ziN(0,σ2) leads to a greater diversity of images that are deemed consistent with the weather forecast. But note that the realized diversity is also a function of t: it is smallest at t = 0, and greatest when there is a change of weather conditions at t = 3 h. (top) A sequence of images taken by the Cevio camera between 0600 and 1200 UTC 16 Mar 2020. (bottom rows) Nowcasting visualizations generated with σ = (0, 0.2, 0.5, 1.0).

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0028.1

5. Conclusions and future work

We have shown that photographic images not only visualize the weather conditions of the past and the present, but they can be useful for visualizing weather forecasts as well. Using conditional generative adversarial networks, it is possible to synthesize photographic visualizations that look realistic, match the predicted weather conditions, transition seamlessly from observation to forecast, and show a high degree of visual continuity between consecutive forecasting lead times.

Meteorological services could use such visualizations to communicate the multiple aspects of localized forecasts in a single medium that is immediately accessible to the user. They could also provide a service to communities and tourism organizations for creating forecast visualizations that are specific for their web camera feeds.

The visualization method introduced in this paper is mature enough to become the first generation of an operational forecast product. But there are several next steps that could improve its visual fidelity and accuracy.

Training GANs is computationally intensive, which is why we had to limit the image size to 64 × 128 pixels. But there exist techniques in the literature (e.g., Karras et al. 2018) that scale GANs to image sizes of at least a megapixel, enhancing the visual fidelity of the synthesized images. Training the production grade high-resolution network could be done cost effectively in the cloud on sufficiently capable hardware.

Our results of section 4b indicate that the generator is rarely to blame for mismatches between the actual and the visualized weather conditions. Most errors occur when the weather descriptor fails to accurately capture the conditions visible in the image. Using subhourly and subkilometer forecasts will improve the accuracy of the weather descriptors, and therefore the accuracy of the visualization method. Evaluating the forecast output fields at multiple locations in the line of sight of the camera could also be beneficial.

The visual continuity of the generated images still suffers from inconsistencies. While the temporal evolution of the ground and illumination conditions looks natural, our GAN architecture still struggles with synthesizing the movement of the sun and isolated clouds across the sky. To address these issues, we will continue to investigate nonlocal network layers that could complement the convolution layers. We will also investigate whether synthesizing whole sequences of images (instead of single images) can further improve the temporal evolution of nowcast visualizations.

Finally, if it is possible to quickly adapt a view-independent generator to a specific view (ideally with a single image), this would open the door to novel interactive applications of photographic visualizations. Smartphone users could obtain personalized forecasts by taking an image of the local scenery and a reading of their geographic coordinates and then receive a generated time-lapse image sequence that shows the predicted weather conditions in their near future.

Acknowledgments.

We are grateful to Rega for giving us permission to use images from the Cevio camera in this project. We thank Tanja Weusthoff for the preparation of the COSMO-1 forecast data. We thank Christian Allemann, Yannick Bernard, Thérèse Obrist, Eliane Thürig, Deborah van Geijtenbeek, and Abbès Zerdouk for taking part in the perceptual evaluations. We further thank Daniele Nerini for providing his expertise on nowcasting and postprocessing of forecasts. Finally, we thank the reviewers for their valuable feedback and suggestions on the initial version of the text.

Data availability statement.

A Tensorflow implementation of the generator and discriminator networks, as well as the code to reproduce the experiments presented in section 4, is available in the companion repository (https://doi.org/10.5281/zenodo.6962721). The repository also contains the trained network weights for the three camera locations, and additional generated images (of which Figs. 815 are examples). The data used in the expert evaluations and their detailed results (summarized in Tables 24) are also available there. The complete image archive and COSMO-1 data used for the training and evaluation of the networks cannot be published online due to licensing restrictions. But they can be obtained free of charge for academic research purposes by contacting the MeteoSwiss customer service (https://www.meteoswiss.admin.ch/about-us/contact/contact-form.html).

REFERENCES

Save