Deep Learning for Postprocessing Global Probabilistic Forecasts on Subseasonal Time Scales

Nina Horat aKarlsruhe Institute of Technology, Karlsruhe, Germany

Search for other papers by Nina Horat in
Current site
Google Scholar
PubMed
Close
and
Sebastian Lerch aKarlsruhe Institute of Technology, Karlsruhe, Germany
bHeidelberg Institute for Theoretical Studies, Heidelberg, Germany

Search for other papers by Sebastian Lerch in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Subseasonal weather forecasts are becoming increasingly important for a range of socioeconomic activities. However, the predictive ability of physical weather models is very limited on these time scales. We propose four postprocessing methods based on convolutional neural networks to improve subseasonal forecasts by correcting systematic errors of numerical weather prediction models. Our postprocessing models operate directly on spatial input fields and are therefore able to retain spatial relationships and to generate spatially homogeneous predictions. They produce global probabilistic tercile forecasts for biweekly aggregates of temperature and precipitation for weeks 3–4 and 5–6. In a case study based on a public forecasting challenge organized by the World Meteorological Organization, our postprocessing models outperform the bias-corrected forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF), and achieve improvements over climatological forecasts for all considered variables and lead times. We compare several model architectures and training modes and demonstrate that all approaches lead to skillful and well-calibrated probabilistic forecasts. The good calibration of the postprocessed forecasts emphasizes that our postprocessing models reliably quantify the forecast uncertainty based on deterministic input information in the form of ECMWF ensemble mean forecast fields only.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Nina Horat, nina.horat@kit.edu

Abstract

Subseasonal weather forecasts are becoming increasingly important for a range of socioeconomic activities. However, the predictive ability of physical weather models is very limited on these time scales. We propose four postprocessing methods based on convolutional neural networks to improve subseasonal forecasts by correcting systematic errors of numerical weather prediction models. Our postprocessing models operate directly on spatial input fields and are therefore able to retain spatial relationships and to generate spatially homogeneous predictions. They produce global probabilistic tercile forecasts for biweekly aggregates of temperature and precipitation for weeks 3–4 and 5–6. In a case study based on a public forecasting challenge organized by the World Meteorological Organization, our postprocessing models outperform the bias-corrected forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF), and achieve improvements over climatological forecasts for all considered variables and lead times. We compare several model architectures and training modes and demonstrate that all approaches lead to skillful and well-calibrated probabilistic forecasts. The good calibration of the postprocessed forecasts emphasizes that our postprocessing models reliably quantify the forecast uncertainty based on deterministic input information in the form of ECMWF ensemble mean forecast fields only.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Nina Horat, nina.horat@kit.edu

1. Introduction

Planning and decision making in public health, agriculture, energy supply, water resource management, and other weather-dependent sectors often happens weeks to months in advance (White et al. 2017, 2022). However, due to the lack of forecast skill of physics-based numerical weather prediction (NWP) models on the subseasonal to seasonal (S2S) time scale, stake-holders so far mostly rely on either short- to medium-range weather forecasts or on seasonal outlooks (White et al. 2022). Therefore, more research on improving S2S forecasts is urgently needed. One important and widely researched approach is to better identify and exploit windows of opportunity, i.e., times with higher predictability associated with modes of intraseasonal and seasonal variability, for example the Madden–Julian oscillation and the El Niño–Southern Oscillation (e.g., Mariotti et al. 2020; Robertson et al. 2020; Mayer and Barnes 2021). On the other hand, the use of postprocessing models to correct systematic errors of NWP forecasts has become a standard practice in research and operations in short- to medium-range weather forecasting (Vannitsem et al. 2021). However, there is a remarkable lack of postprocessing approaches for S2S forecasting (Robertson et al. 2020), even though it seems reasonable to expect improvements also on these time scales (Merryfield et al. 2020).

Over the past years, the use of modern machine learning (ML) techniques such as random forests (Taillardat et al. 2016), gradient boosting (Messner et al. 2017) or neural networks (Rasp and Lerch 2018) has become a key focal point of research activities in postprocessing (Haupt et al. 2021; Vannitsem et al. 2021). In many applications to medium-range forecasting, neural network–based methods have superseded traditional statistical approaches and show substantial improvements in predictive performance. The superior performance of the ML models can mainly be attributed to their ability to better incorporate arbitrary input predictors and to more flexibly model nonlinear relationships. For a general overview of recent developments, we refer to Vannitsem et al. (2021) and Haupt et al. (2021).

One major challenge in statistical postprocessing is to retain spatial and temporal relationships in the postprocessed forecasts, as well as relationships between variables (Vannitsem et al. 2021). In particular, the few examples of S2S postprocessing studies that we are aware of tend to separately operate on single grid cells only, and thus are neither able to exploit spatial information in the raw forecasts, nor produce spatially homogeneous forecasts (e.g., Vigaud et al. 2017, 2019; Mouatadid et al. 2023a,b; Zhang et al. 2023). For short- to medium-range forecasts, convolutional neural network (CNN)-based architectures and model components have been used for a variety of postprocessing applications (Dai and Hemri 2021; Grönquist et al. 2021; Veldkamp et al. 2021; Chapman et al. 2022; Lerch and Polsterer 2022; Li et al. 2022; Ben-Bouallegue et al. 2023; Hu et al. 2023). CNN architectures are designed for image-like data and can operate on spatial data directly. Therefore, they enable the generation of spatially homogeneous predictions and have the ability to extract and learn spatial error structures. Other methods (in particular those mentioned in the previous paragraph) use data from individual grid cells only. Therefore, such methods are not able to exploit the spatial information in the raw NWP forecasts and might produce spatially inhomogeneous forecasts.

As noted above, research on S2S postprocessing with CNNs is scarce. One of the rare exceptions is the study of Scheuerer et al. (2020) who propose a CNN architecture that estimates coefficient values for a set of basis functions to create spatial forecasts for precipitation over California on S2S time scales. In 2021, the World Meteorological Organization (WMO) coordinated a challenge to assess and promote the use of artificial intelligence (AI) for improving S2S forecasts (Vitart et al. 2022). The data provided within the framework of the challenge mainly consists of ML-ready (re)forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF). The setup of the challenge and the corresponding data provide a convenient starting point for the development of ML-based postprocessing methods for S2S forecasts, which produce probabilistic, spatially coherent and calibrated forecasts.

We propose four CNN-based postprocessing methods for global temperature and precipitation forecasts with a lead time of 14 and 28 days. Our models utilize spatial forecast fields from the ECMWF predictions of several weather variables as input, and provide global probabilistic predictions for terciles (below-normal, near-normal, and above-normal conditions) as commonly done in S2S forecasting (e.g., Vigaud et al. 2017, 2019; Mariotti et al. 2020; Robertson et al. 2020; White et al. 2022). We train our postprocessing models either directly on the full global input fields, or on smaller quadratic subdomains. While the training on global forecasts allows the postprocessing models to learn from nonlocal information, in particular teleconnections, training on smaller subdomains reduces noise in the input since irrelevant information further away from the target region is cropped. To the best of our knowledge, the proposed models are the first to consider spatial context information for global S2S postprocessing by operating on spatial inputs directly. Using the setup of the WMO-organized S2S AI Challenge as case study, we compare our CNN models to climatological and ECMWF reference forecasts, and discuss the results within the context of the challenge submissions, which were largely based on gridcell-specific approaches (Vitart et al. 2022).

The remainder of this paper is organized as follows. Section 2 introduces the S2S AI Challenge data and setup, and section 3 discusses the benchmark methods. Our CNN-based postprocessing models are described in section 4. The main results are presented in section 5, and the paper concludes with a discussion in section 6. Python code with implementations of all methods is available at https://github.com/HoratN/pp-s2s.

2. Data

The data used for this study stems from the WMO Prize Challenge to Improve Subseasonal to Seasonal Predictions Using Artificial Intelligence1 that took place from June to November 2021 (Vitart et al. 2022) and which we will refer to as the S2S AI Challenge in the following. This challenge was coordinated by WMO to foster the use of AI for postprocessing S2S forecasts, and to promote the S2S database (Vitart et al. 2017) for training AI methods. The dataset consists of global forecasts and reforecasts for precipitation and temperature from ECMWF. Weekly ensemble reforecasts with 11 members are provided for the years 2000–19, which were to be used as training data, and forecasts with 51 members for the year 2020, which served as test data. The process of training postprocessing models on reforecasts and applying them to forecasts poses methodological challenges, for example because of the different number of ensemble members, but is a common approach in operational weather forecasting (e.g., Ben-Bouallegue et al. 2023; Demaeyer et al. 2023).

All predictions come at a spatial resolution of 1.5° and a temporal resolution of 2 weeks. The biweekly aggregated (re)forecasts have a lead time of 14 and 28 days, respectively, and are computed by averaging daily temperature over days 15–28 (for week 3–4), and days 29–42 (for week 5–6) as described in Vitart et al. (2022). Biweekly precipitation (re)forecasts are obtained by accumulating daily precipitation rates over the same periods. The challenge team provided verifying gridded observations over land from the NOAA Climate Prediction Center (CPC) for precipitation and temperature. The observational and (re)forecast data are available through the S2S AI Challenge repository2 and a dedicated CliMetLab plugin that provides forecast data for additional variables, and from other forecasting centers.3

The main prediction task within the framework of the S2S AI Challenge was to provide categorical probability forecasts for “below-normal,” “near-normal,” and “above-normal” values of the target variables. To define these categories, the challenge team computed category edges based on the empirical 1/3 and 2/3 quantiles of the observations from the training period (years 2000–19). These edges were computed separately for each grid cell and week of the year to account for spatial and seasonal differences, and were used for both training and testing of the postprocessing models.

3. Benchmark methods

a. Climatological forecasts

An exceedingly simple, yet often competitive benchmark in subseasonal to seasonal forecasting is the climatological forecast. In the setting described above, the probabilistic climatological forecast simply assigns a probability of 1/3 to each of the three categories. Given the computation of the category edges is based on historical data, the climatological forecast can generally be expected to be calibrated.

b. ECMWF baseline and correction

The S2S AI Challenge also provided a more advanced benchmark method for the year 2020 based on ECMWF ensemble forecasts, henceforth called ECMWF baseline. Categorical probability forecasts are obtained by computing the empirical frequencies of the 51-member ensemble after binning the members into three categories. To enforce unbiased forecasts, the binning is done with respect to the category edges computed from the reforecasts instead of using the respective category edges based on historical observations. For very dry regions (accumulated precipitation over two weeks < 0.01 mm m−2), a climatological forecast is issued for precipitation since the tercile edges are too close together, preventing a robust computation.

The ECMWF baseline provided within the framework of the S2S AI Challenge comes with two limitations. First, it is only available for the test dataset (the year 2020). Second, a closer inspection of the precipitation forecasts revealed that the probabilities over the three categories sometimes do not sum up to 1 due to an erroneous treatment of missing values in the ensemble members. This problematic behavior becomes apparent in Fig. 1, which shows the sum of the predicted probabilities of the three categories for all grid points. Despite being averaged over the entire year 2020, substantial deviations from 1 can be observed for a large fraction of grid points.

Fig. 1.
Fig. 1.

Sum of the forecast probabilities for the three tercile categories as provided by the ECMWF baseline within the framework of the S2S AI Challenge. Shown is the average over all forecasts from 2020.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

Based on the FORTRAN code used to create the ECMWF baseline (F. Vitart 2022, personal communication), we implemented our own baseline method following the same idea, but accounting for members with missing values by computing the tercile probabilities based on the number of nonmissing members only. For very dry regions, we follow the construction of the ECMWF baseline and issue a climatological forecast for precipitation. We refer to this benchmark as corrected ECMWF baseline.

4. Convolutional neural network architectures

In this section we will introduce the different postprocessing approaches employed in this study. We start by describing the predictors and the preprocessing of the data in section 4a. Section 4b then provides a brief introduction to CNNs and UNets in particular and motivates the choice of the four different model architectures investigated in this work. The subsequent sections 4c and 4d provide technical details on the model architectures employed here, and section 4e details technicalities related to model training.

a. Model inputs

The input to our postprocessing models consists of global ECMWF forecasts of multiple meteorological variables. In addition to the target variable forecasts, we use geopotential height forecasts at 500 and 850 hPa, as well as sea level pressure forecasts as inputs for postprocessing temperature forecasts. For precipitation, total column water forecasts are considered in addition since we assume the forecast skill of total column water forecasts to be higher than for precipitation. The precipitation (re)forecasts contain negative and missing values, in particular in drier regions. We set all negative precipitation accumulations to zero.

Instead of using the whole ensemble as predictor for our postprocessing models, we only use predictors based on the ensemble mean. We do not use the ensemble spread or other measures of variability within the ensemble members since the benefit of using those as additional predictors has been found to be small in previous studies (e.g., Rasp and Lerch 2018; Schulz and Lerch 2022; Höhlein et al. 2024). For providing tercile probability forecasts, the exact values of the target variable predictor forecasts are not relevant but relative values with respect to the boundaries between the three categories are more informative. Therefore, we compute the distance from the ensemble mean of the target variable forecast to the respective tercile category edges, which yields the distance to the upper and lower boundary of the middle tercile as two new predictors. For the nontarget variable predictors, we compute anomaly values for the ensemble mean with respect to a local and week-specific climatology (i.e., a separate climatology for each grid cell and week of the year). All predictors are further standardized with the gridcell-specific standard deviation. Before feeding the spatial fields to the DL models, we fill all missing values (i.e., all ocean grid cells and some very dry grid cells) with zeros.

b. Overview of model architectures

We propose four different postprocessing methods for gridded data that are based on CNN architectures. All architectures follow an encoder–decoder structure. The encoder part detects structures in the global gridded input data and extracts lower-dimensional latent features from the image-like input. The decoder part can then make use of the encoded information to create a spatial prediction, providing a probabilistic forecast for tercile categories at every land grid cell globally.

The encoder part of our CNNs uses alternating layers of convolutional and pooling operations. Convolutional layers allow the model to learn and detect structures in image-like input data. Instead of learning separate weights for each pixel in the image (as a dense neural network would do), weights are shared among pixels by using so-called filters. These filters are three-dimensional weight tensors learned during training. The third dimension corresponds to the number of input maps; the number of the filters corresponds to the number of different structures the model can detect. To detect the structures, the filter tensor slides over the image and a dot product between a part of the image and the filter is computed. Between the convolutional layers, pooling layers are used that aggregate grid points to amplify the signal found by the convolutional layers. With every convolutional and pooling layer pair, the image resolution decreases and more and more complex structures are learned from the input image. While this feature extraction part, the encoder, varies only slightly for the four different methods proposed here, the decoder part of the four postprocessing architectures differs significantly.

In the following we briefly summarize the key ideas behind the four postprocessing models. The first architecture that we propose is based on the concept of a UNet, a specific type of a CNN architecture that is used for image segmentation and was first proposed in medical applications (Ronneberger et al. 2015; Litjens et al. 2017). In atmospheric sciences, it is also gaining popularity in particular for detection, nowcasting and forecasting (e.g., Ayzel et al. 2020; Lagerquist et al. 2021; Chapman et al. 2022; Quinting and Grams 2022). The UNet architecture has a symmetric encoder–decoder architecture (see Fig. 2 for a schematic illustration of the model architecture) and is specifically tailored to image-to-image tasks. In contrast to traditional CNN architectures for image classification tasks, UNets do not classify an image as a whole but instead assign a class probability to every pixel. Therefore, UNets can be used to directly obtain probabilities for tercile categories jointly at all grid cells. In a UNet, the upscaling from the learned low-dimensional representation to the full image size (i.e., the decoding) is done using so-called upsampling layers, for which we use transposed convolutions. The output of the upsampling layer is concatenated with the feature maps of the same size from the feature extraction (encoder) part of the model (as depicted in Fig. 2). These concatenations are called skip connections and help to reconstruct the fine details from the high-resolution input.

Fig. 2.
Fig. 2.

The UNet architecture used for temperature (for precipitation, one additional predictor is used, resulting in six inputs). The number of global maps (channels) is denoted on top of each box, the spatial extent of these feature maps is given on the left side of each box for the patchwise training (left number) and the global training (right number).

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

In addition to UNet-based models, we further explore ways of adapting a standard CNN architecture to enable spatial predictions. Previous work by Scheuerer et al. (2020) proposes a standard CNN that returns coefficient values for local, spatially smooth basis functions as output in order to create spatial forecasts for precipitation over California on subseasonal time scales. A direct application of this method to global data is infeasible since the coefficient matrix for an appropriate number of basis functions becomes prohibitively large for a global output domain at 1.5° resolution. We therefore extend this approach to global data by splitting the original global map into multiple smaller domains, so-called patches, and train the basis-function CNN (BF-CNN) on these patches. Figure 3 shows a schematic of this model architecture. Note that the patches of input data will generally be larger than the corresponding output patches on which the models produce probability forecasts, similar to the setting in Li et al. (2022).

Fig. 3.
Fig. 3.

The architecture of the BF-CNN and the TC-CNN model for temperature (for precipitation, one dense layer less is used in the encoder and the input contains six channels). The two architectures share the same encoder, but use different strategies to derive a spatial prediction from the lower-dimensional representation. The decoder of the basis-function CNN is shown in the middle on top, and the corresponding decoder for the CNN with transposed convolutions is shown on the right. The number of channels is denoted on top of each box, and the size of the feature maps or the length of the vector is given on the left side of each box. Dark gray shaded boxes symbolize three-dimensional matrices, while light gray boxes correspond to two-dimensional fields, and the thin black lines denote one-dimensional vectors.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

We further propose a slightly more flexible variant of the BF-CNN approach by replacing the basis functions with transposed convolutional layers as depicted in Fig. 3. These transposed convolutional layers directly yield predictions of category probabilities for all pixels within a considered output patch. We introduce this model as a way to limit model complexity (compared to a complete UNet architecture), while still keeping part of the UNet’s flexibility for creating spatial forecasts. We abbreviate this model architecture with TC-CNN.

The two models based on standard CNN architectures, namely, BF-CNN and TC-CNN, are always trained patchwise and produce separate predictions for each patch. The UNet architecture on the other hand can be trained either directly on global input data (global UNet) or patchwise on smaller parts of the global forecasts (patchwise UNet). We will abbreviate patchwise with “pw” and append this abbreviation to the model names of the patchwise trained models in figures and tables to improve clarity.

While the global UNet has the potential to learn from nonlocal information, in particular teleconnections, its performance might be deteriorated by the overload of information not clearly related to the forecast error at the target location. Since the signal-to-noise ratio is low in S2S forecasts, it is not necessarily clear whether the global UNet would be able to benefit from having access to global information. If the global information cannot be exploited effectively, it might be beneficial to crop irrelevant information further away from the target region to reduce noise in the input data, which is what the patchwise training does. The patchwise UNet model, bridging the patchwise standard CNN architectures and the global UNet approach, will help us to distinguish between the effect of using a more sophisticated model architecture and the opportunity to learn from global input information.

Given that training data from 20 years of weekly initialized reforecasts is scant, amounting to roughly 1000 global training samples only, the patchwise training can also be considered as a helpful data augmentation technique, resulting in 30 000 or more samples depending on patch size and target variable. A conceptually related approach can be found in postprocessing models for short- to medium-range ensemble forecasts, where models for different grid cells are often trained by pooling all available locations and using samples from all grid cells to train one postprocessing model (e.g., Rasp and Lerch 2018).

c. Architecture details for the UNet models

The UNet architecture is a CNN with an encoder–decoder architecture. The encoder part, responsible for decreasing the resolution of the input fields and for learning patterns, here consists of three blocks. The ith block comprises two convolutional layers with 4 × 2i filters followed by a batch-normalization and an average-pooling layer. These encoder blocks are followed by a bottle-neck block that contains two convolutional layers with 64 filters and a batch-normalization layer. The output of the bottle-neck is fed into the decoder part, that is responsible for the upscaling of the low-resolution pattern to the original resolution, and also consists of three blocks. The ith block (with i from 3 to 1, i.e., in reversed order) starts with a transposed convolutional layer with 4 × 2i filters, whose output is then concatenated with the output of the batch-normalization layer of the ith encoder block. The result of this so-called skip connection is further processed by two convolutional layers with 4 × 2i filters and a batch-normalization layer. One final convolutional layer with three filters with kernel size and stride equal to one and softmax activation function concludes the UNet architecture. Unless stated differently, all convolutional layers in the UNet architecture use a kernel size of 3 × 3, stride one, “same” padding, and exponential linear unit (ELU) activation. The transposed convolutional layers use the same parameters except that a stride of two is used.

d. Architecture details for the standard CNN models

We slightly adapt the model architecture of the basis-function CNN model proposed in Scheuerer et al. (2020) depending on the target variable. For precipitation we use the same encoder architecture as in Scheuerer et al. (2020), for temperature we double the number of filters in the two convolutional layers, resulting in the encoder outlined in Table 1. The coarse resolution two-dimensional fields obtained by these operations are flattened and dropout is applied with dropout rate of 0.4. One (for precipitation) or two (for temperature) dense layer with 10 nodes and ELU activation conclude the encoder part of the BF-CNN and the TC-CNN models. We use input patches of 34 × 34 grid cells for both temperature and precipitation.

Table 1.

Encoder for standard CNN models for temperature.

Table 1.

The decoder part responsible for the upscaling to the target domain differs between the BF-CNN and the TC-CNN models. For the BF-CNN another dense layer with ELU activation and 27 nodes is applied, with the number of nodes matching the number of basis functions (nine) times the number of terciles (three). The resulting vector is reshaped into a 3 × 9 matrix and represents the coefficient values for the basis functions. The nine entries per tercile are now mapped to the 8 × 8 output domain by multiplying with the nine basis functions (64 × 9 matrix). The basis functions are spatially smooth functions with circular bounded support (with a radius of 16 grid cells) that are uniformly distributed over the output domain, see Scheuerer et al. (2020) for details. An illustration is included as part of Fig. 3. Following Scheuerer et al. (2020), we add a vector of logarithmic climatological probabilities (equal to 1/3 in our case) and reshape the output into a tensor of shape 8 × 8 × 3. Applying the softmax function to this tensor across the third dimension yields a probabilistic forecast for terciles for a domain of 8 × 8 grid cells.

The decoder part of the TC-CNN model is slightly simpler. The output of the last dense layer of the encoder is passed through another dense layer with 48 (3 × 16) nodes and ELU activation. This vector is reshaped into a 4 × 4 × 3 tensor. Next, two 3 × 3 transposed convolutional layers are applied with three filters and stride two, ELU activation, and “same” padding, increasing the domain size from 4 × 4 over 8 × 8 to 16 × 16, matching the output domain of 16 × 16 grid cells. Again, applying the softmax function across the third dimension of the output tensor yields a probabilistic forecast for terciles for a domain of 16 × 16 grid cells. The optimal size of the output domain (8 versus 16 grid cells) was found by hyperparameter optimization using cross validation, as further described below.

e. Model training

Our DL methods are implemented in Python using Keras (Chollet et al. 2015) and TensorFlow (Abadi et al. 2016). During model training, we minimize the categorical cross-entropy with the Adam optimizer (Kingma and Ba 2017). The training and hyperparameter optimization is done with 10-fold cross validation on 20 years of training data (2000–19). The 10 folds consist of two consecutive years each, i.e., 2000 and 2001, 2002 and 2003, and so on. Each fold is used once as validation fold, while the remaining nine folds are used for model training. The preprocessing is done separately for training and validation data of each split to prevent data leakage. The final prediction for the test year 2020 is computed by averaging over the predictions of the 10 models obtained by 10-fold cross validation.

For the patchwise models, the data preparation also includes generating patches, i.e., selecting small quadratic domains from the global input fields of, for example, eight grid cells in longitude and latitude direction each. As indicated in Fig. 4, input patches (in red) are chosen slightly larger than output patches (in green) to improve the prediction at the patch edges. We allow for overlapping training patches to increase the number of training samples and use a patch stride of 12, which means that once we have selected a patch, we move 12 grid cells east or south before selecting the next patch. The amount of overlap is different for every architecture depending on the size of the output patch. Not all patches sampled in this way are also useful for training, since no ground truth data are available over the ocean. Therefore, we discard all patches with more than 50% missing values in the output patch. For the prediction, the output patches are defined without overlap (Fig. 4, dashed and solid lines) and the separate predictions for the individual patches are stitched together to obtain a global prediction. As discussed earlier, the predictor patches are larger than the output patches. To make the output patches cover the whole globe, we therefore have to pad the global input fields before patch selection. The padding together with the specific patch sizes for the different model architectures is detailed in appendix B. Note that we also pad the input of the globally trained UNet to avoid large discontinuities at the date line and the poles.

Fig. 4.
Fig. 4.

Visualization of example input patches (red) and the respective output patches (green) for the patchwise models. The solid and dashed squares represent two different patches used for creating a connected spatial prediction for a larger domain.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

Hyperparameters related to model training vary between global and patchwise training and the different model architectures. The global UNet is trained for at most 50 epochs with a batch size of 16. The patchwise models are trained for a maximum of 20 epochs with a larger batch size of 32 or 64 since more training samples are available. We have roughly 1000 global forecasts, resulting in more than 30 000 valid samples for the patchwise models. We always use early stopping but adapt the patience according to the total number of training epochs, resulting in a patience of 10 and 3 for the global and the patchwise models, respectively. Training stops for all models before the maximum number of epochs (50 or 20) is reached. Since the precipitation forecasts are much less accurate and the corresponding ground truth data are much noisier than for temperature, the postprocessing models have more difficulties to learn from precipitation data. We therefore introduce delayed early stopping and adjust the learning rate for the precipitation models. The training configuration is detailed in appendix B.

The loss function averages over all grid cells of the entire globe (for the global UNet) or over the output patch (for the patchwise models). While it is standard practice in meteorological applications to weigh the contribution of each grid cell with the corresponding area fraction of the grid cell (cosine of the latitude), DL architectures operating on image-like data usually do not use any pixel (grid cell) weighting scheme when computing the loss function on the entire image. However, DL models for data-driven weather prediction often adapt model architecture and data processing to account for the spherical nature of atmospheric data (e.g., Weyn et al. 2020). We therefore here adapt the standard categorical-cross entropy loss to account for the gridcell weight during training. This modification can be used for both patchwise and global training. For the patchwise training it leads to differently weighted patches, depending on the patch location. The effect of introducing the gridcell weights on the predictive performance is discussed in detail in section 5f.

5. Results

a. Forecast evaluation setup

Following the setup of the S2S AI Challenge, we evaluate the proposed postprocessing methods based on the ranked probability skill score (RPSS) for the test set (year 2020), using the climatological forecast as a reference. The RPSS is based on a strictly proper scoring rule and introduced in detail in appendix A. It is positively oriented, i.e., larger values indicate better forecasts. Following the challenge configuration, only land grid cells are considered for the computation of the spatially aggregated RPSS scores since no observations are available for sea grid cells. Only grid cells north of 60°S are included in the aggregated RPSS. For precipitation, very dry regions (lower tercile edge smaller than 0.01 mm m−2) are not taken into account. For each grid cell, the average ranked probability score (RPS; see appendix A for details) of the model predictions and the climatological reference forecast (described in section 3) is computed by aggregating over the 53 Thursday forecasts in 2020, which form the test dataset of the S2S AI Challenge. From these gridcell-wise RPS values, we compute a gridcell-wise RPSS, which is then averaged over the target domain (weighted by the area fraction of each grid cell).

b. General results

Figure 5 summarizes the globally averaged RPSS of the DL methods and the two ECMWF benchmarks, consisting of the benchmark provided by the challenge organizers and our corrected ECMWF baseline. All proposed postprocessing methods show positive skill when compared to climatological forecasts, and generally outperform the ECMWF baseline forecasts as indicated by the higher RPSS. The largest improvements over the ECMWF baselines are achieved for the precipitation forecasts, where all postprocessing methods are able to turn the highly nonskillful forecasts into skillful forecasts (RPSS > 0). For temperature, the relative improvement is smaller, in particular for the shorter lead time of 14 days (i.e., weeks 3–4). Further, the positive effects of the correction of the ECMWF baseline forecasts discussed in section 3 are clearly indicated by the substantial improvement in terms of RPSS.

Fig. 5.
Fig. 5.

Globally aggregated RPSS for the test year 2020. RPSS values larger than zero indicate that forecasts are better than climatology. The RPSS of the ECMWF temperature forecasts for a lead time of 28 days is positive, but orders of magnitudes too small to be visible in the figure.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

The UNet architectures achieve the best scores for all four test cases (temperature and precipitation for weeks 3–4 and weeks 5–6). We attribute this to the fact that the UNet architecture is very well suited for the task of splitting the global forecasts into different regions (above, near, and below normal), since its architecture was specifically developed for image segmentation. This allows us to train a well performing model with as little as roughly 1000 global forecasts. Overall, the global UNet yields the best results with a global RPSS of 3.1% averaged over all four test cases. For temperature for weeks 3–4, the global UNet is outperformed by the patchwise UNet, for weeks 5–6 both UNet variants achieve very similar results. While the global UNet receives global information and therefore might be able to exploit teleconnection information, the patchwise UNet likely benefits from only seeing the local surroundings and therefore does not have to deal with less informative (i.e., noisier) information from further away. The BF-CNN and TC-CNN models show fairly similar predictive performances, with slight advantages for the more complex, but also more flexible TC-CNN approach.

All figures and numbers show the RPSS of the average prediction obtained from 10 model runs based on the cross-validation procedure applied for training. The skill of this mean prediction is always higher than the expected skill of the individual model runs. Model averaging yields the largest benefit for the global UNet with a relative improvement of almost 30% compared to the average skill of the individual model runs. The skill of the other approaches improves by 10%–17%.

Figure 6 shows the postprocessed predictions for an example forecast issued in January 2020 for temperature for weeks 3–4. Additional example forecasts for precipitation are available in appendix C, and forecasts for weeks 5–6 are provided in the online supplemental material. While all methods generally agree on the spatial structure of the temperature distribution, there are visible differences between the methods due to the different training modes and architectures. All patchwise models suffer from artifacts at the patch edges. While these artifacts are hardly visible for the patchwise UNet, they are much more pronounced for the two standard CNN methods. For the patchwise UNet these artifacts could potentially be reduced by choosing a larger margin between input and output domain. For the BF-CNN these artifacts are likely due to the limited ability of the decoder to recover the full spatial variability within the patch. For the TC-CNN, there are also visible artifacts inside the patches and therefore these artifacts might stem from the upscaling using transposed convolutions. It is interesting to note that the forecasts also look reasonable over the ocean even though no ground truth data were available there.

Fig. 6.
Fig. 6.

Example predictions for temperature with a lead time of 14 days issued on 2 Jan 2020. In the top row, the corresponding observations are displayed, followed by the predictions of the two benchmark methods and the four postprocessed forecasts.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

The postprocessed forecasts are generally much smoother and less sharp than the ECMWF benchmarks. Further, the forecasts for weeks 5–6 are less sharp than the week 3–4 forecasts. Since ensemble forecasts from numerical weather prediction models have been demonstrated to often be overconfident (e.g., Vannitsem et al. 2021), it is not surprising that postprocessing reduces the forecast sharpness, as discussed in more detail in section 5d. The reduced sharpness of the postprocessed forecasts also is in line with the findings of Vigaud et al. (2017) who postprocessed subseasonal precipitation forecasts. Even though the reduction in forecast sharpness leads to well-calibrated forecasts in our study, it could also be a side effect of the convolutional layers or the gridcell-wise loss function we employed as discussed in Lagerquist and Ebert-Uphoff (2022). We also find that the probabilities for the near-normal category are very close to the climatological probability of 1/3, in particular for the postprocessed forecasts. The scattered/intermittent nature of the near-normal tercile and the small differences between the boundaries of the upper and lower terciles likely favored conservative forecasts for the middle tercile. This issue is discussed in detail in section 5g.

c. Spatial verification

For evaluating the spatial differences in model performance, we define three regions, the northern extratropics, abbreviated as NH, that correspond to 90°–30°N, the tropics corresponding to 30°N–30°S, and the southern extratropics, abbreviated as SH and covering 30°–60°S. We also present global gridcell-wise RPSS maps in Figs. 7 and 8 to show the fine spatial details.

Fig. 7.
Fig. 7.

Gridcell-wise RPSS for temperature for 2020. RPSS values larger than zero indicate that forecasts are better than climatology.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

Fig. 8.
Fig. 8.

Gridcell-wise RPSS for precipitation for 2020. RPSS values larger than zero indicate that forecasts are better than climatology. Very dry grid cells are omitted from the evaluation (e.g., parts of the Saharan Desert).

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

For temperature, the ECMWF baselines already achieve good skill over the tropics (for both lead times) and the Northern Hemisphere (for weeks 3–4) as shown in Table 2. Our UNet-based postprocessing methods further improve these forecasts. The largest skill is reached over Russia and parts of equatorial Africa and America, as well as over Southeast Asia (see Fig. 7). The largest negative skill can be observed in the tropics, but also southern Africa and Australia contain regions that are poorly forecasted. The skill of the two standard CNN architectures—BF-CNN and TC-CNN—is partly deteriorated by patch-edge artifacts.

Table 2.

RPSS averaged over the northern extratropics (NH; 90°–30°N), the tropics (30°N–30°S), and the southern extratropics (SH; 30°–60°S). The global domain consists of all three regions mentioned before, and corresponds to the values shown in Fig. 5. The bold font highlights the best performing model for each task.

Table 2.

For precipitation, the picture is less clear since the regions with positive skill are more scattered than for temperature (see Fig. 8) even though postprocessing yields smoother skill maps. Striking is the high skill over Antarctica, where the postprocessing models presumably were able to correct the bias in the ECMWF ensemble mean.

Certainly, one year of weekly forecasts is not enough to make conclusive and robust statements in what regions the ECMWF forecasts and the postprocessed predictions are skillful. Nevertheless, these maps provide a starting point to better understand how our postprocessing methods correct subseasonal forecasts. Overall, postprocessing improves the forecasts in regions where the ECMWF baseline has negative skill. The improvement either comes from reducing the amplitude of the negative skill in these regions or from achieving positive skill with respect to climatology for regions that formerly showed negative skill. For regions where the ECMWF baseline is already skillful, the reduction in sharpness is detrimental and leads to a slight reduction in skill. To facilitate the comparison to the ECMWF baseline, RPSS maps with respect to the corrected ECMWF baseline instead of climatology are provided in the supplemental material.

The overall benefit of postprocessing is further clearly demonstrated by the increase in the area with positive forecast skill compared to the benchmark methods. For temperature, the percentage of grid cells with positive skill increases from roughly 68% for the ECMWF benchmark methods to up to 78% for the patchwise UNet architecture for weeks 3–4. The increase for weeks 5–6 is slightly larger than the increase for weeks 3–4, with postprocessing increasing the share of grid cells with positive skill from 55% to up to 66%. For precipitation, postprocessing increases the amount of grid cells with positive forecast skill for weeks 3–4 and weeks 5–6 from 51% to roughly 70% and from 36%–40% to around 60%, respectively. Note that analogous to the computation of the spatially aggregated RPSS, only grid cells north of 60°S were used for this comparison.

d. Forecast calibration

The overall aim of probabilistic forecasting is to maximize the sharpness of the predictive distribution, subject to calibration (Gneiting et al. 2007). The example predictions in Fig. 6 clearly indicate that the postprocessed forecasts are less sharp than the ECMWF baseline, calling for an assessment of calibration. While the calibration of binary (dichotomous) categorical forecasts can readily be checked with reliability diagrams (Murphy and Winkler 1977; Wilks 2020), evaluating the calibration of a probabilistic three-category forecast is less straightforward. Wilks (2013) proposes a so-called calibration simplex, an extension of the well-known reliability diagram to the three-category case. Based on the R package “CalSim” (Resin 2021), we created calibration simplices for the corrected ECMWF baseline and all postprocessing methods. Since the simplices for the different postprocessing approaches look very similar, we only show results for one of the best performing models in terms of RPSS, the global UNet, in Fig. 9.

Fig. 9.
Fig. 9.

Calibration simplices for temperature and precipitation of the corrected ECMWF baseline and the global UNet. The simplices are based on all forecasts for 2020, but only consider land grid cells north of 60°S, i.e., omit Antarctica. For each of the three terciles, forecast probabilities are binned in 10 bins (from 0/9 to 9/9) and the size of the dots in the hexagons represents the fraction of probability vectors falling in that “three-dimensional” bin. The position of the dots shows the observed conditional event frequency, i.e., the conditional occurrence of each tercile. The displacement of the dot with respect to the center of each hexagon (indicated by the red lines) shows the miscalibration error. A displacement to the edge of the respective hexagon corresponds to a probability mismatch of 0.6.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

Analogous to the reliability diagram, the calibration simplex depicts the predicted probabilities and the conditional frequency of the observed category. For each of the three terciles, forecast probabilities are binned in 10 bins (from 0/9 to 9/9) and the size of the dots in the hexagons represents the fraction of probability vectors falling in that “three-dimensional” bin. The position of the dots shows the observed conditional event frequency, i.e., the conditional occurrence of each tercile. The displacement of the dot with respect to the center of each hexagon (indicated by the red lines) shows the miscalibration error. The closer the dot is located to the center of the respective hexagons, the better calibrated the forecast. Further details on the calibration simplex can be found in Wilks (2013) where section 2 contains an extensive explanation of the calibration simplex and its relation to the reliability diagram.

For both target variables the corrected ECMWF baseline produces too sharp forecasts, indicated by the substantial displacement of the dots toward the center of the calibration simplex. (Note that a displacement to the edge of the respective hexagon corresponds to a probability mismatch of 0.6.) In fact, the calibration simplex for the ECMWF baseline in Fig. 9 is a perfect example of an overconfident forecast. Additional examples for over- and underconfident as well as biased forecasts are available in Wilks (2013, section 3). By contrast, the postprocessing methods yield well-calibrated forecasts with appropriate sharpness given the limited predictability. Since the input to the postprocessing models only consists of ensemble mean quantities, this indicates that the DL models are well able to correctly quantify forecast uncertainty from deterministic predictor information alone. This finding is in line with results from the postprocessing literature (e.g., Rasp and Lerch 2018), as well as Sacco et al. (2022) who investigate how well neural networks can estimate forecast uncertainty related to model error and initial conditions and conclude that neural networks can reliably estimate forecast uncertainty. On the other hand, Lagerquist and Ebert-Uphoff (2022) emphasize that gridcell-wise loss functions incentivize smooth predictions, and hence more elaborate architectures with a spatial loss function might provide sharper forecasts that are also calibrated.

The point clouds for the temperature forecasts from the ECMWF baseline and the global UNet approach are both displaced toward the upper left corner of the calibration simplex in Fig. 9, i.e., toward higher-than-normal probabilities for the above-normal tercile. This either indicates that the year 2020 was exceptionally warm or is simply a signal of climate change, as was also observed by Wilks (2013) in his case study. All temperature forecasts correctly predict the above-normal category more often than climatologically expected, since there is no miscalibration error (i.e., red lines) visible that would match this displacement. The probabilities for the near-normal category are particularly close to the climatological value of 1/3, which can be seen from the vertical alignment of the dots belonging to the 3/9 bin of the near-normal tercile.

e. Comparison to S2S AI challenge submissions

Overall, the postprocessing methods presented here would (retrospectively) place in the top three of the S2S AI Challenge submissions [Table 2 and Vitart et al. (2022)].4 For most target variable and forecast horizon combinations, our models show a solid performance and would place second or third. That said, the distance to the winning team is partially fairly large. This is not surprising since the winning team in the challenge proposed a complex mixture model based on locally corrected forecasts from ECMWF and other forecasting centers, a CNN-based correction of the ECMWF forecasts, and climatology. The multimodel approach presumably leads to large skill improvements, independent of the applied postprocessing methods (Doblas-Reyes et al. 2005; Vigaud et al. 2017; Li et al. 2021).

A key distinguishing feature of our models, in particular of the global UNet approach, is that they do not rely on information from single grid cells only, but rather utilize spatial context information to produce probabilistic predictions. On the one hand, this leads to a lower performance for some regions with, for example, very specific temperature biases and potentially lower number of grid cells, such as the southern extratropics or the tropics. On the other hand, the use of spatial information might be the reason why our models perform very well for precipitation and achieve even similar performance to the winning team’s predictions. In the tropics, our UNet models outperform all challenge predictions for precipitation.

f. Effects of using a weighted loss function for model training

Overall, the absolute differences between the models trained without and with weighted loss are small. This implies that whether gridcell values are weighted during training according to their gridcell size or not, does not yield entirely different postprocessing models. We compare the forecast skill of the models trained with weighted loss to the ones trained without by computing the RPSS of the weighted loss models using the unweighted models as reference:
RPSSweighted=RPSunweightedRPSweightedRPSunweightedRPSopt,
where RPSopt = 0.

The corresponding RPSS values are shown in Table 3. It can be seen that the weighting improves the forecast skill the most in the tropics, since errors in this area influence the overall loss more strongly with the weighting. When using the weighted loss for the patchwise models, patches from higher latitudes influence model training less than patches from equatorial regions. The weighting improves the predictions of the global UNet model the most, also having a positive effect on extratropical scores for temperature. For the patchwise models the signal is less clear and changes in skill are mostly small. Hence, the weighted loss seems to influence the patchwise training less, likely because the individual patches are too small for the weighting to have any real effect and the weighting of the patches as a whole happens only within training batches. We conclude that for the patchwise models adapting the loss function has no clear effect, but globally trained models benefit if the error at each grid cell is weighted by the respective gridcell area. A weighted loss function for a globally trained model could also potentially be used to fine-tune the model to specific regions since the weights can also be changed during model training.

Table 3.

RPSS of the weighted models with respect to the unweighted models. The bold font highlights cases where the weighting yields better models.

Table 3.

g. Critical perspective on tercile approach

The example forecasts (Fig. 6) clearly show that the models issue the most prudent forecasts for the middle tercile with probabilities close to climatology. The same behavior can be observed for the precipitation forecasts (shown in appendix C) and agrees with the general understanding that skill for the middle tercile is limited (van den Dool and Toth 1991; Becker and van den Dool 2016). Wilks (2013) observed the same behavior for temperature forecasts over the United States. Indeed, predicting the middle category correctly is challenging, since the lower and upper tercile edge confining the middle category can be very close together as shown in Fig. 10. For example, in the tropics, the temperature tercile edges differ by less than 1 K. Correctly predicting such small temperature fluctuations 3–6 weeks ahead seems to be very challenging in light of the generally low signal-to-noise ratio, and likely provides hardly any added value to the forecast user. For precipitation, forecasts in desert areas with category edge differences below 0.01 mm m−2 of precipitation in two weeks are presumably also difficult and of low relevance to the forecast user.

Fig. 10.
Fig. 10.

Difference between the upper tercile edge (2/3 quantile) and the lower tercile edge (1/3 quantile) for four selected weeks of the year representing the four seasons. Shown are the tercile edges for mean temperature and accumulated precipitation for weeks 3–4.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

It seems that framing the forecasting task as a tercile classification problem might not always simplify the forecasting problem compared to predicting the actual values of the target variable, but rather adds another level of complexity. However, forecast quality appears to be independent of the distance between tercile edges, since the correlation between the RPS of the forecasts and the distance between the tercile edges is close to zero. Nevertheless, the benefit of tercile forecasts for variables with very narrow climatological distribution remains questionable.

White et al. (2017) point out that users of probabilistic tercile forecasts also have to know the usual climatology, i.e., the specific values for the tercile edges to make use of the forecasts, and that such forecasts do not provide actionable information on timing, location and scale of weather events. Further, White et al. (2017) and Brunet et al. (2010) argue that S2S predictions should realistically represent the weather and its statistics. The ECMWF baseline forecasts well represent the day-to-day weather conditions, as seen by the fine detailed structure in the example forecast in Fig. 6, but are not calibrated according to Fig. 9 and the discussion in section 5d. On the other hand, the postprocessed predictions are very smooth and certainly do not represent day-to-day weather, but they are well calibrated. Given the limited predictability on S2S time scales, probabilistic tercile forecasts can presumably not accomplish both, accurately representing the weather statistics, and at the same time provide calibrated probabilities.

6. Conclusions

We propose CNN-based postprocessing methods to generate spatially homogeneous probabilistic predictions for precipitation and temperature on subseasonal time scales. Two of the four presented postprocessing methods extend earlier work from Scheuerer et al. (2020) to a global scale, and build on the idea of combining CNN models and basis-function approaches from geostatistics. The remaining two models adapt the well-known UNet architecture (Ronneberger et al. 2015) to the S2S postprocessing application. The postprocessing models are either trained on collections of small quadratic domains of the global input fields, so-called patches, or directly on the global predictor fields. Spatial fields of ECMWF ensemble mean forecasts of selected meteorological variables thereby serve as input variables. (Besides the target variable, we used geopotential height at 500 and 850 hPa, and mean sea level pressure for temperature, and additionally total column water for precipitation.) The postprocessing models yield global probabilistic tercile forecasts for biweekly aggregates of temperature and precipitation on a 1.5° grid as output. The setting of our study is based on the S2S AI Challenge (Vitart et al. 2022) in which our postprocessing models (retrospectively) place among the top three submissions.

All our postprocessing models consistently achieve better skill than the bias-corrected ECMWF baseline forecast and outperform the climatological forecasts for both variables and forecast horizons. The UNet architecture clearly outperforms the previously proposed standard CNN architecture in most test cases. The global UNet model performs best in three out of four test cases and achieves an RPSS of 0.064 for temperature for weeks 3–4 (and of 0.026 for weeks 5–6), and RPSS values of 0.025 and 0.01 for precipitation, respectively. All models including the ECMWF baseline achieve the highest skill in the tropics (30°N–30°S). The postprocessed forecasts are substantially less sharp and much smoother than the ECMWF baseline. However, in contrast to the ECMWF forecasts, they are well calibrated and their reduced sharpness reflects the limited predictability on S2S time scales. The improvements in calibration further demonstrate that our postprocessing models enable a reliable quantification of forecast uncertainty based on deterministic input information in the form of the ECMWF ensemble mean forecast fields only.

The main methodological innovation of our approach is the use of spatial input fields. Therefore, our CNN-based methods are able to retain spatial relationships in the forecasts and have the potential to learn spatial error structures, which may explain our particularly good results for precipitation. With the spatially weighted loss function we employ, the models can in principle be fine-tuned to every target region. The global UNet is the only one of the four approaches that has the potential to learn from nonlocal information, and in particular from teleconnections, since the patchwise training impedes the exploitation of cross-patch information. Nevertheless, the global UNet variant was not able to outperform the patchwise version for temperature in our case study. The patchwise training allows us to train the models longer, leading to lower variability across model instances from different cross-validation folds. Therefore, we assume that the patchwise training is actually beneficial for postprocessing forecasts with higher signal-to-noise ratio. However, if the signal-to-noise ratio is low, additional training samples do not provide further details on the forecast error and therefore do not improve the postprocessing. Further work is needed to investigate the trade-off between sample size and field of view.

Of the four postprocessing approaches we proposed in this study, the globally trained UNet is superior to the others in two ways. First, the globally trained UNet is computationally the least expensive. The training of the patchwise models takes more than seven times longer due to the large number of patches, vastly increasing the number of training samples. Second, the UNet architecture is a standard architecture that can be readily implemented. In particular the global UNet can be easily adapted to other datasets, since no sophisticated data loader is needed. These factors make the global UNet the preferable method for all target variables and lead times.

The proposed postprocessing models provide several avenues for future work. An important pathway toward further improvements is the incorporation of additional predictor information. On the one hand, additional spatial fields of meteorological variables such as soil moisture and sea surface temperature (SST) are known to carry predictability on S2S time scales (White et al. 2017; Merryfield et al. 2020; Robertson et al. 2020). However, including those variables did not improve the predictive performance and comes with additional technical difficulties since for example the SST inputs are only available over the ocean (and soil moisture only over land), which would require the CNN models to learn different filters for land and ocean grid cells. On the other hand, a promising option to generalize and improve our postprocessing models is the incorporation of information from slowly varying components of the climate system (e.g., the Madden–Julian oscillation or the state of the stratosphere) which might enable skillful forecasts on longer time scales via teleconnection pathways (Lang et al. 2020). First tests indicated that combining SST inputs with teleconnection information could lead to further improvements of the postprocessing models; however, adaptations of the model architectures will be required to appropriately combine the different types of input information.

Despite the clear improvements over the physical and climatological reference forecasts, the skill of the postprocessed forecasts remains limited, with RPSS values on the order of a few percent. To allow for a better assessment of the utility of postprocessing on S2S time scales, it would be interesting to further investigate the effects of postprocessing on downstream applications of weather forecasts such as fully integrated renewable energy forecasting systems (e.g., Haupt et al. 2020). For example, in the context of medium-range wind power prediction, it has been demonstrated that postprocessing wind speed forecasts can even be detrimental to the quality of the resulting wind power forecasts, calling for a targeted development of application-specific postprocessing methods (Phipps et al. 2022).

4

Note that we had submitted forecasts from a model that was a much simpler variant of the BF-CNN approach described here to the S2S AI Challenge, which placed fourth.

Acknowledgments.

The research leading to these results has been done within the Young Investigator Group “Artificial Intelligence for Probabilistic Weather Forecasting” funded by the Vector Stiftung. We further acknowledge funding by the Federal Ministry of Education and Research (BMBF) and the Baden-Württemberg Ministry of Science, Research and Arts as part of the Excellence Strategy of the German federal and state governments. We thank Frederic Vitart for providing Fortran code for the ECMWF baseline, and Johannes Resin for advice on the calibration simplices. We further thank Peter Knippertz, Christian Grams, Julian Quinting, Jieyu Chen, Michael Scheuerer, and all team members contributing to the original submission to the S2S AI Challenge for helpful discussions, and three anonymous reviewers for their insightful and constructive comments.

Data availability statement.

The data are available through the S2S AI Challenge repository (https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/tree/master/data) and a dedicated CliMetLab plugin that provides forecast data for additional variables and from other forecasting centers (https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge). Python code with implementations of all methods is provided via the repository at https://github.com/HoratN/pp-s2s.

APPENDIX A

Evaluation Metrics

We evaluate our predictions with the ranked probability skill score (RPSS; Epstein 1969; Murphy 1971). This skill score is based on the ranked probability score (RPS) which is a strictly proper scoring rule for verifying probabilistic multicategory forecasts. It computes the squared difference of the forecast cumulative density function Fm and the observation vector Om:
RPS=m=1M(FmOm)2.
The term M denotes the number of categories, and is here equal to three (below normal, near normal, above normal). A perfect forecast achieves an RPS of 0. For details, see Wilks (2020).
The RPSS compares the RPS of the DL prediction f with the score obtained by a reference forecast, which is in our case a climatological forecast that assigns equal probabilities of 1/3 to all three outcomes. The RPSS is computed as
RPSSf=RPSrefRPSfRPSrefRPSopt,
where RPSopt = 0. Skill scores larger than zero correspond to skillful forecasts. Skill scores are positively oriented, larger values thus indicate better forecasts.

APPENDIX B

Further Details on Model Training and Hyperparameters

Model hyperparameters were tuned using cross validation resulting in different combinations depending on model architecture, target variable and training mode (patchwise or global). Hyperparameters related to the generation of the training batches are shown in Table B1, parameters related more closely to model training are detailed in Table B2. These hyperparameter choices will be discussed in the following.

Table B1.

Hyperparameters related to the generation of training batches.

Table B1.
Table B2.

Hyperparameters related to model training.

Table B2.

a. Padding

For the global training, we pad the input fields with eight grid cells on every side to avoid large discontinuities at the date line and the poles. For the patchwise training, we need the padding of the input fields to ensure that the smaller output patches (i.e., the predictions) cover the whole globe. We pad the global fields on all sides before the patch selection according to the difference between input and output patch size as well as the output patch size itself:
pad=(psinputpsoutput)/2+psoutput,
where “ps” refers to the patch size, i.e., the number of grid cells in latitude and longitude direction each. The corresponding hyperparameters for global and patchwise training are detailed in Table B1.

b. Early stopping and learning rate decay

Using standard early stopping did not work for precipitation since some validation years were particularly hard to predict and training would stop already after the first epoch. Still, we assumed that a model trained for more than one epoch would perform better than a barely trained model, which is why we introduce delayed early stopping. With delayed early stopping, the stopping criteria is checked for the first time at the end of the second epoch for the patchwise models and at the end of the fifth epoch for the global model (with the epoch count starting at 0). Further, the learning rate was adapted during training to avoid that the models get stuck in local optima. For precipitation, we introduced learning rate decay since this seemed to improve the generalization ability of the models. The final configurations are summarized in Table B2.

APPENDIX C

Additional Figure

Figure C1 displays the postprocessed predictions for an example forecast issued in January 2020 for precipitation for weeks 3-4, complementing Fig. 6 in section 5.

Fig. C1.
Fig. C1.

Example predictions for precipitation with a lead time of 14 days issued on 2 Jan 2020.

Citation: Monthly Weather Review 152, 3; 10.1175/MWR-D-23-0150.1

REFERENCES

  • Abadi, M., and Coauthors, 2016: TensorFlow: A system for large-scale machine learning. Proc. USENIX 12th Symp. on Operating Systems Design and Implementation, Savannah, GA, USENIX Association, 265–283, https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.

  • Ayzel, G., T. Scheffer, and M. Heistermann, 2020: RainNet v1.0: A convolutional neural network for radar-based precipitation nowcasting. Geosci. Model Dev., 13, 26312644, https://doi.org/10.5194/gmd-13-2631-2020.

    • Search Google Scholar
    • Export Citation
  • Becker, E., and H. van den Dool, 2016: Probabilistic seasonal forecasts in the North American Multimodel Ensemble: A baseline skill assessment. J. Climate, 29, 30153026, https://doi.org/10.1175/JCLI-D-14-00862.1.

    • Search Google Scholar
    • Export Citation
  • Ben-Bouallegue, Z., J. A. Weyn, M. C. A. Clare, J. Dramsch, P. Dueben, and M. Chantry, 2023: Improving medium-range ensemble weather forecasts with hierarchical ensemble transformers. arXiv, 2303.17195v3, https://arxiv.org/abs/2303.17195.

  • Brunet, G., and Coauthors, 2010: Collaboration of the weather and climate communities to advance subseasonal-to-seasonal prediction. Bull. Amer. Meteor. Soc., 91, 13971406, https://doi.org/10.1175/2010BAMS3013.1.

    • Search Google Scholar
    • Export Citation
  • Chapman, W. E., L. D. Monache, S. Alessandrini, A. C. Subramanian, F. M. Ralph, S.-P. Xie, S. Lerch, and N. Hayatbini, 2022: Probabilistic predictions from deterministic atmospheric river forecasts with deep learning. Mon. Wea. Rev., 150, 215234, https://doi.org/10.1175/MWR-D-21-0106.1.

    • Search Google Scholar
    • Export Citation
  • Chollet, F., and Coauthors, 2015: Keras: The Python deep learning library version 3.0. Keras, accessed 19 July 2023, https://keras.io.

  • Dai, Y., and S. Hemri, 2021: Spatially coherent postprocessing of cloud cover ensemble forecasts. Mon. Wea. Rev., 149, 39233937, https://doi.org/10.1175/MWR-D-21-0046.1.

    • Search Google Scholar
    • Export Citation
  • Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 26352653, https://doi.org/10.5194/essd-15-2635-2023.

    • Search Google Scholar
    • Export Citation
  • Doblas-Reyes, F. J., R. Hagedorn, and T. N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting—II. Calibration and combination. Tellus, 57A, 234252, https://doi.org/10.3402/tellusa.v57i3.14658.

    • Search Google Scholar
    • Export Citation
  • Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor., 8, 985987, https://doi.org/10.1175/1520-0450(1969)008<0985:ASSFPF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.

    • Search Google Scholar
    • Export Citation
  • Grönquist, P., C. Yao, T. Ben-Nun, N. Dryden, P. Dueben, S. Li, and T. Hoefler, 2021: Deep learning for post-processing ensemble weather forecasts. Philos. Trans. Roy. Soc., A379, 20200092, https://doi.org/10.1098/rsta.2020.0092.

    • Search Google Scholar
    • Export Citation
  • Haupt, S. E., and Coauthors, 2020: Combining artificial intelligence with physics-based methods for probabilistic renewable energy forecasting. Energies, 13, 1979, https://doi.org/10.3390/en13081979.

    • Search Google Scholar
    • Export Citation
  • Haupt, S. E., W. Chapman, S. V. Adams, C. Kirkwood, J. S. Hosking, N. H. Robinson, S. Lerch, and A. C. Subramanian, 2021: Towards implementing artificial intelligence post-processing in weather and climate: Proposed actions from the Oxford 2019 workshop. Philos. Trans. Roy. Soc., A379, 20200091, https://doi.org/10.1098/rsta.2020.0091.

    • Search Google Scholar
    • Export Citation
  • Höhlein, K., B. Schulz, R. Westermann, and S. Lerch, 2024: Postprocessing of ensemble weather forecasts using permutation-invariant neural networks. Artif. Intell. Earth Syst., 3, e230070, https://doi.org/10.1175/AIES-D-23-0070.1.

    • Search Google Scholar
    • Export Citation
  • Hu, W., M. Ghazvinian, W. E. Chapman, A. Sengupta, F. M. Ralph, and L. Delle Monache, 2023: Deep learning forecast uncertainty for precipitation over western United States. Mon. Wea. Rev., 151, 13671385, https://doi.org/10.1175/MWR-D-22-0268.1.

    • Search Google Scholar
    • Export Citation
  • Kingma, D. P., and J. Ba, 2017: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://arxiv.org/abs/1412.6980.

  • Lagerquist, R., and I. Ebert-Uphoff, 2022: Can we integrate spatial verification methods into neural network loss functions for atmospheric science? Artif. Intell. Earth Syst., 1, e220021, https://doi.org/10.1175/AIES-D-22-0021.1.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., J. Q. Stewart, I. Ebert-Uphoff, and C. Kumler, 2021: Using deep learning to nowcast the spatial coverage of convection from Himawari-8 satellite data. Mon. Wea. Rev., 149, 38973921, https://doi.org/10.1175/MWR-D-21-0096.1.

    • Search Google Scholar
    • Export Citation
  • Lang, A. L., K. Pegion, and E. A. Barnes, 2020: Introduction to special collection: “Bridging weather and climate: Subseasonal-to-seasonal (S2S) prediction.” J. Geophys. Res. Atmos., 125, e2019JD031833, https://doi.org/10.1029/2019JD031833.

    • Search Google Scholar
    • Export Citation
  • Lerch, S., and K. L. Polsterer, 2022: Convolutional autoencoders for spatially-informed ensemble post-processing. arXiv, 2204.05102v1, https://arxiv.org/abs/2204.05102.

  • Li, W., B. Pan, J. Xia, and Q. Duan, 2022: Convolutional neural network-based statistical post-processing of ensemble precipitation forecasts. J. Hydrol., 605, 127301, https://doi.org/10.1016/j.jhydrol.2021.127301.

    • Search Google Scholar
    • Export Citation
  • Li, Y., D. Tian, and H. Medina, 2021: Multimodel subseasonal precipitation forecasts over the contiguous United States: Skill assessment and statistical postprocessing. J. Hydrometeor., 22, 25812600, https://doi.org/10.1175/JHM-D-21-0029.1.

    • Search Google Scholar
    • Export Citation
  • Litjens, G., and Coauthors, 2017: A survey on deep learning in medical image analysis. Med. Image Anal., 42, 6088, https://doi.org/10.1016/j.media.2017.07.005.

    • Search Google Scholar
    • Export Citation
  • Mariotti, A., and Coauthors, 2020: Windows of opportunity for skillful forecasts subseasonal to seasonal and beyond. Bull. Amer. Meteor. Soc., 101, E608E625, https://doi.org/10.1175/BAMS-D-18-0326.1.

    • Search Google Scholar
    • Export Citation
  • Mayer, K. J., and E. A. Barnes, 2021: Subseasonal forecasts of opportunity identified by an explainable neural network. Geophys. Res. Lett., 48, e2020GL092092, https://doi.org/10.1029/2020GL092092.

    • Search Google Scholar
    • Export Citation
  • Merryfield, W. J., and Coauthors, 2020: Current and emerging developments in subseasonal to decadal prediction. Bull. Amer. Meteor. Soc., 101, E869E896, https://doi.org/10.1175/BAMS-D-19-0037.1.

    • Search Google Scholar
    • Export Citation
  • Messner, J. W., G. J. Mayr, and A. Zeileis, 2017: Nonhomogeneous boosting for predictor selection in ensemble postprocessing. Mon. Wea. Rev., 145, 137147, https://doi.org/10.1175/MWR-D-16-0088.1.

    • Search Google Scholar
    • Export Citation
  • Mouatadid, S., P. Orenstein, G. Flaspohler, J. Cohen, M. Oprescu, E. Fraenkel, and L. Mackey, 2023a: Adaptive bias correction for improved subseasonal forecasting. Nat. Commun., 14, 3482, https://doi.org/10.1038/s41467-023-38874-y.

    • Search Google Scholar
    • Export Citation
  • Mouatadid, S., and Coauthors, 2023b: SubseasonalClimateUSA: A dataset for subseasonal forecasting and benchmarking. 37th Conf. on Neural Information Processing Systems Datasets and Benchmarks Track, New Orleans, LA, NeurIPS https://openreview.net/forum?id=pWkrU6raMt.

  • Murphy, A. H., 1971: A note on the ranked probability score. J. Appl. Meteor., 10, 155156, https://doi.org/10.1175/1520-0450(1971)010<0155:ANOTRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature. J. Roy. Stat. Soc., Appl. Stat., 26C, 4147, https://doi.org/10.2307/2346866.

    • Search Google Scholar
    • Export Citation
  • Phipps, K., S. Lerch, M. Andersson, R. Mikut, V. Hagenmeyer, and N. Ludwig, 2022: Evaluating ensemble post-processing for wind power forecasts. Wind Energy, 25, 13791405, https://doi.org/10.1002/we.2736.

    • Search Google Scholar
    • Export Citation
  • Quinting, J. F., and C. M. Grams, 2022: EuLerian Identification of ascending AirStreams (ELIAS 2.0) in numerical weather prediction and climate models—Part 1: Development of deep learning model. Geosci. Model Dev., 15, 715730, https://doi.org/10.5194/gmd-15-715-2022.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 38853900, https://doi.org/10.1175/MWR-D-18-0187.1.

    • Search Google Scholar
    • Export Citation
  • Resin, J., 2021: CalSim: The calibration simplex, version 0.5.2. R package, accessed 19 December 2023, https://CRAN.R-project.org/package=CalSim.

  • Robertson, A. W., F. Vitart, and S. J. Camargo, 2020: Subseasonal to seasonal prediction of weather to climate with application to tropical cyclones. J. Geophys. Res. Atmos., 125, e2018JD029375, https://doi.org/10.1029/2018JD029375.

    • Search Google Scholar
    • Export Citation
  • Ronneberger, O., P. Fischer, and T. Brox, 2015: U-Net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, N. Navab et al., Eds., Springer, 234–241.

  • Sacco, M. A., J. J. Ruiz, M. Pulido, and P. Tandeo, 2022: Evaluation of machine learning techniques for forecast uncertainty quantification. Quart. J. Roy. Meteor. Soc., 148, 34703490, https://doi.org/10.1002/qj.4362.

    • Search Google Scholar
    • Export Citation
  • Scheuerer, M., M. B. Switanek, R. P. Worsnop, and T. M. Hamill, 2020: Using artificial neural networks for generating probabilistic subseasonal precipitation forecasts over California. Mon. Wea. Rev., 148, 34893506, https://doi.org/10.1175/MWR-D-20-0096.1.

    • Search Google Scholar
    • Export Citation
  • Schulz, B., and S. Lerch, 2022: Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Mon. Wea. Rev., 150, 235257, https://doi.org/10.1175/MWR-D-21-0150.1.

    • Search Google Scholar
    • Export Citation
  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Search Google Scholar
    • Export Citation
  • van den Dool, H. M., and Z. Toth, 1991: Why do forecasts for “near normal” often fail? Wea. Forecasting, 6, 7685, https://doi.org/10.1175/1520-0434(1991)006<0076:WDFFNO>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • Veldkamp, S., K. Whan, S. Dirksen, and M. Schmeits, 2021: Statistical postprocessing of wind speed forecasts using convolutional neural networks. Mon. Wea. Rev., 149, 11411152, https://doi.org/10.1175/MWR-D-20-0219.1.

    • Search Google Scholar
    • Export Citation
  • Vigaud, N., A. W. Robertson, and M. K. Tippett, 2017: Multimodel ensembling of subseasonal precipitation forecasts over North America. Mon. Wea. Rev., 145, 39133928, https://doi.org/10.1175/MWR-D-17-0092.1.

    • Search Google Scholar
    • Export Citation
  • Vigaud, N., M. K. Tippett, J. Yuan, A. W. Robertson, and N. Acharya, 2019: Probabilistic skill of subseasonal surface temperature forecasts over North America. Wea. Forecasting, 34, 17891806, https://doi.org/10.1175/WAF-D-19-0117.1.

    • Search Google Scholar
    • Export Citation
  • Vitart, F., and Coauthors, 2017: The Subseasonal to Seasonal (S2S) prediction project database. Bull. Amer. Meteor. Soc., 98, 163173, https://doi.org/10.1175/BAMS-D-16-0017.1.

    • Search Google Scholar
    • Export Citation
  • Vitart, F., and Coauthors, 2022: Outcomes of the WMO prize challenge to improve subseasonal to seasonal predictions using artificial intelligence. Bull. Amer. Meteor. Soc., 103, E2878E2886, https://doi.org/10.1175/BAMS-D-22-0046.1.

    • Search Google Scholar
    • Export Citation
  • Weyn, J. A., D. R. Durran, and R. Caruana, 2020: Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere. J. Adv. Model. Earth Syst., 12, e2020MS002109, https://doi.org/10.1029/2020MS002109.

    • Search Google Scholar
    • Export Citation
  • White, C. J., and Coauthors, 2017: Potential applications of Subseasonal-to-Seasonal (S2S) predictions. Meteor. Appl., 24, 315325, https://doi.org/10.1002/met.1654.

    • Search Google Scholar
    • Export Citation
  • White, C. J., and Coauthors, 2022: Advances in the application and utility of Subseasonal-to-Seasonal predictions. Bull. Amer. Meteor. Soc., 103, E1448E1472, https://doi.org/10.1175/BAMS-D-20-0224.1.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2013: The calibration simplex: A generalization of the reliability diagram for three-category probability forecasts. Wea. Forecasting, 28, 12101218, https://doi.org/10.1175/WAF-D-13-00027.1.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2020: Statistical Methods in the Atmospheric Sciences. 4th ed. Elsevier, 840 pp., https://doi.org/10.1016/C2017-0-03921-6.

  • Zhang, L., T. Yang, S. Gao, Y. Hong, Q. Zhang, X. Wen, and C. Cheng, 2023: Improving subseasonal-to-seasonal forecasts in predicting the occurrence of extreme precipitation events over the contiguous U.S. using machine learning models. Atmos. Res., 281, 106502, https://doi.org/10.1016/j.atmosres.2022.106502.

    • Search Google Scholar
    • Export Citation

Supplementary Materials

Save
  • Abadi, M., and Coauthors, 2016: TensorFlow: A system for large-scale machine learning. Proc. USENIX 12th Symp. on Operating Systems Design and Implementation, Savannah, GA, USENIX Association, 265–283, https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.

  • Ayzel, G., T. Scheffer, and M. Heistermann, 2020: RainNet v1.0: A convolutional neural network for radar-based precipitation nowcasting. Geosci. Model Dev., 13, 26312644, https://doi.org/10.5194/gmd-13-2631-2020.

    • Search Google Scholar
    • Export Citation
  • Becker, E., and H. van den Dool, 2016: Probabilistic seasonal forecasts in the North American Multimodel Ensemble: A baseline skill assessment. J. Climate, 29, 30153026, https://doi.org/10.1175/JCLI-D-14-00862.1.

    • Search Google Scholar
    • Export Citation
  • Ben-Bouallegue, Z., J. A. Weyn, M. C. A. Clare, J. Dramsch, P. Dueben, and M. Chantry, 2023: Improving medium-range ensemble weather forecasts with hierarchical ensemble transformers. arXiv, 2303.17195v3, https://arxiv.org/abs/2303.17195.

  • Brunet, G., and Coauthors, 2010: Collaboration of the weather and climate communities to advance subseasonal-to-seasonal prediction. Bull. Amer. Meteor. Soc., 91, 13971406, https://doi.org/10.1175/2010BAMS3013.1.

    • Search Google Scholar
    • Export Citation
  • Chapman, W. E., L. D. Monache, S. Alessandrini, A. C. Subramanian, F. M. Ralph, S.-P. Xie, S. Lerch, and N. Hayatbini, 2022: Probabilistic predictions from deterministic atmospheric river forecasts with deep learning. Mon. Wea. Rev., 150, 215234, https://doi.org/10.1175/MWR-D-21-0106.1.

    • Search Google Scholar
    • Export Citation
  • Chollet, F., and Coauthors, 2015: Keras: The Python deep learning library version 3.0. Keras, accessed 19 July 2023, https://keras.io.

  • Dai, Y., and S. Hemri, 2021: Spatially coherent postprocessing of cloud cover ensemble forecasts. Mon. Wea. Rev., 149, 39233937, https://doi.org/10.1175/MWR-D-21-0046.1.

    • Search Google Scholar
    • Export Citation
  • Demaeyer, J., and Coauthors, 2023: The EUPPBench postprocessing benchmark dataset v1.0. Earth Syst. Sci. Data, 15, 26352653, https://doi.org/10.5194/essd-15-2635-2023.

    • Search Google Scholar
    • Export Citation
  • Doblas-Reyes, F. J., R. Hagedorn, and T. N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting—II. Calibration and combination. Tellus, 57A, 234252, https://doi.org/10.3402/tellusa.v57i3.14658.

    • Search Google Scholar
    • Export Citation
  • Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor., 8, 985987, https://doi.org/10.1175/1520-0450(1969)008<0985:ASSFPF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243268, https://doi.org/10.1111/j.1467-9868.2007.00587.x.

    • Search Google Scholar
    • Export Citation
  • Grönquist, P., C. Yao, T. Ben-Nun, N. Dryden, P. Dueben, S. Li, and T. Hoefler, 2021: Deep learning for post-processing ensemble weather forecasts. Philos. Trans. Roy. Soc., A379, 20200092, https://doi.org/10.1098/rsta.2020.0092.

    • Search Google Scholar
    • Export Citation
  • Haupt, S. E., and Coauthors, 2020: Combining artificial intelligence with physics-based methods for probabilistic renewable energy forecasting. Energies, 13, 1979, https://doi.org/10.3390/en13081979.

    • Search Google Scholar
    • Export Citation
  • Haupt, S. E., W. Chapman, S. V. Adams, C. Kirkwood, J. S. Hosking, N. H. Robinson, S. Lerch, and A. C. Subramanian, 2021: Towards implementing artificial intelligence post-processing in weather and climate: Proposed actions from the Oxford 2019 workshop. Philos. Trans. Roy. Soc., A379, 20200091, https://doi.org/10.1098/rsta.2020.0091.

    • Search Google Scholar
    • Export Citation
  • Höhlein, K., B. Schulz, R. Westermann, and S. Lerch, 2024: Postprocessing of ensemble weather forecasts using permutation-invariant neural networks. Artif. Intell. Earth Syst., 3, e230070, https://doi.org/10.1175/AIES-D-23-0070.1.

    • Search Google Scholar
    • Export Citation
  • Hu, W., M. Ghazvinian, W. E. Chapman, A. Sengupta, F. M. Ralph, and L. Delle Monache, 2023: Deep learning forecast uncertainty for precipitation over western United States. Mon. Wea. Rev., 151, 13671385, https://doi.org/10.1175/MWR-D-22-0268.1.

    • Search Google Scholar
    • Export Citation
  • Kingma, D. P., and J. Ba, 2017: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://arxiv.org/abs/1412.6980.

  • Lagerquist, R., and I. Ebert-Uphoff, 2022: Can we integrate spatial verification methods into neural network loss functions for atmospheric science? Artif. Intell. Earth Syst., 1, e220021, https://doi.org/10.1175/AIES-D-22-0021.1.

    • Search Google Scholar
    • Export Citation
  • Lagerquist, R., J. Q. Stewart, I. Ebert-Uphoff, and C. Kumler, 2021: Using deep learning to nowcast the spatial coverage of convection from Himawari-8 satellite data. Mon. Wea. Rev., 149, 38973921, https://doi.org/10.1175/MWR-D-21-0096.1.

    • Search Google Scholar
    • Export Citation
  • Lang, A. L., K. Pegion, and E. A. Barnes, 2020: Introduction to special collection: “Bridging weather and climate: Subseasonal-to-seasonal (S2S) prediction.” J. Geophys. Res. Atmos., 125, e2019JD031833, https://doi.org/10.1029/2019JD031833.

    • Search Google Scholar
    • Export Citation
  • Lerch, S., and K. L. Polsterer, 2022: Convolutional autoencoders for spatially-informed ensemble post-processing. arXiv, 2204.05102v1, https://arxiv.org/abs/2204.05102.

  • Li, W., B. Pan, J. Xia, and Q. Duan, 2022: Convolutional neural network-based statistical post-processing of ensemble precipitation forecasts. J. Hydrol., 605, 127301, https://doi.org/10.1016/j.jhydrol.2021.127301.

    • Search Google Scholar
    • Export Citation
  • Li, Y., D. Tian, and H. Medina, 2021: Multimodel subseasonal precipitation forecasts over the contiguous United States: Skill assessment and statistical postprocessing. J. Hydrometeor., 22, 25812600, https://doi.org/10.1175/JHM-D-21-0029.1.

    • Search Google Scholar
    • Export Citation
  • Litjens, G., and Coauthors, 2017: A survey on deep learning in medical image analysis. Med. Image Anal., 42, 6088, https://doi.org/10.1016/j.media.2017.07.005.

    • Search Google Scholar
    • Export Citation
  • Mariotti, A., and Coauthors, 2020: Windows of opportunity for skillful forecasts subseasonal to seasonal and beyond. Bull. Amer. Meteor. Soc., 101, E608E625, https://doi.org/10.1175/BAMS-D-18-0326.1.

    • Search Google Scholar
    • Export Citation
  • Mayer, K. J., and E. A. Barnes, 2021: Subseasonal forecasts of opportunity identified by an explainable neural network. Geophys. Res. Lett., 48, e2020GL092092, https://doi.org/10.1029/2020GL092092.

    • Search Google Scholar
    • Export Citation
  • Merryfield, W. J., and Coauthors, 2020: Current and emerging developments in subseasonal to decadal prediction. Bull. Amer. Meteor. Soc., 101, E869E896, https://doi.org/10.1175/BAMS-D-19-0037.1.

    • Search Google Scholar
    • Export Citation
  • Messner, J. W., G. J. Mayr, and A. Zeileis, 2017: Nonhomogeneous boosting for predictor selection in ensemble postprocessing. Mon. Wea. Rev., 145, 137147, https://doi.org/10.1175/MWR-D-16-0088.1.

    • Search Google Scholar
    • Export Citation
  • Mouatadid, S., P. Orenstein, G. Flaspohler, J. Cohen, M. Oprescu, E. Fraenkel, and L. Mackey, 2023a: Adaptive bias correction for improved subseasonal forecasting. Nat. Commun., 14, 3482, https://doi.org/10.1038/s41467-023-38874-y.

    • Search Google Scholar
    • Export Citation
  • Mouatadid, S., and Coauthors, 2023b: SubseasonalClimateUSA: A dataset for subseasonal forecasting and benchmarking. 37th Conf. on Neural Information Processing Systems Datasets and Benchmarks Track, New Orleans, LA, NeurIPS https://openreview.net/forum?id=pWkrU6raMt.

  • Murphy, A. H., 1971: A note on the ranked probability score. J. Appl. Meteor., 10, 155156, https://doi.org/10.1175/1520-0450(1971)010<0155:ANOTRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature. J. Roy. Stat. Soc., Appl. Stat., 26C, 4147, https://doi.org/10.2307/2346866.

    • Search Google Scholar
    • Export Citation
  • Phipps, K., S. Lerch, M. Andersson, R. Mikut, V. Hagenmeyer, and N. Ludwig, 2022: Evaluating ensemble post-processing for wind power forecasts. Wind Energy, 25, 13791405, https://doi.org/10.1002/we.2736.

    • Search Google Scholar
    • Export Citation
  • Quinting, J. F., and C. M. Grams, 2022: EuLerian Identification of ascending AirStreams (ELIAS 2.0) in numerical weather prediction and climate models—Part 1: Development of deep learning model. Geosci. Model Dev., 15, 715730, https://doi.org/10.5194/gmd-15-715-2022.

    • Search Google Scholar
    • Export Citation
  • Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing ensemble weather forecasts. Mon. Wea. Rev., 146, 38853900, https://doi.org/10.1175/MWR-D-18-0187.1.

    • Search Google Scholar
    • Export Citation
  • Resin, J., 2021: CalSim: The calibration simplex, version 0.5.2. R package, accessed 19 December 2023, https://CRAN.R-project.org/package=CalSim.

  • Robertson, A. W., F. Vitart, and S. J. Camargo, 2020: Subseasonal to seasonal prediction of weather to climate with application to tropical cyclones. J. Geophys. Res. Atmos., 125, e2018JD029375, https://doi.org/10.1029/2018JD029375.

    • Search Google Scholar
    • Export Citation
  • Ronneberger, O., P. Fischer, and T. Brox, 2015: U-Net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, N. Navab et al., Eds., Springer, 234–241.

  • Sacco, M. A., J. J. Ruiz, M. Pulido, and P. Tandeo, 2022: Evaluation of machine learning techniques for forecast uncertainty quantification. Quart. J. Roy. Meteor. Soc., 148, 34703490, https://doi.org/10.1002/qj.4362.

    • Search Google Scholar
    • Export Citation
  • Scheuerer, M., M. B. Switanek, R. P. Worsnop, and T. M. Hamill, 2020: Using artificial neural networks for generating probabilistic subseasonal precipitation forecasts over California. Mon. Wea. Rev., 148, 34893506, https://doi.org/10.1175/MWR-D-20-0096.1.

    • Search Google Scholar
    • Export Citation
  • Schulz, B., and S. Lerch, 2022: Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Mon. Wea. Rev., 150, 235257, https://doi.org/10.1175/MWR-D-21-0150.1.

    • Search Google Scholar
    • Export Citation
  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Search Google Scholar
    • Export Citation
  • van den Dool, H. M., and Z. Toth, 1991: Why do forecasts for “near normal” often fail? Wea. Forecasting, 6, 7685, https://doi.org/10.1175/1520-0434(1991)006<0076:WDFFNO>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • Veldkamp, S., K. Whan, S. Dirksen, and M. Schmeits, 2021: Statistical postprocessing of wind speed forecasts using convolutional neural networks. Mon. Wea. Rev., 149, 11411152, https://doi.org/10.1175/MWR-D-20-0219.1.

    • Search Google Scholar
    • Export Citation
  • Vigaud, N., A. W. Robertson, and M. K. Tippett, 2017: Multimodel ensembling of subseasonal precipitation forecasts over North America. Mon. Wea. Rev., 145, 39133928, https://doi.org/10.1175/MWR-D-17-0092.1.

    • Search Google Scholar
    • Export Citation
  • Vigaud, N., M. K. Tippett, J. Yuan, A. W. Robertson, and N. Acharya, 2019: Probabilistic skill of subseasonal surface temperature forecasts over North America. Wea. Forecasting, 34, 17891806, https://doi.org/10.1175/WAF-D-19-0117.1.

    • Search Google Scholar
    • Export Citation
  • Vitart, F., and Coauthors, 2017: The Subseasonal to Seasonal (S2S) prediction project database. Bull. Amer. Meteor. Soc., 98, 163173, https://doi.org/10.1175/BAMS-D-16-0017.1.

    • Search Google Scholar
    • Export Citation
  • Vitart, F., and Coauthors, 2022: Outcomes of the WMO prize challenge to improve subseasonal to seasonal predictions using artificial intelligence. Bull. Amer. Meteor. Soc., 103, E2878E2886, https://doi.org/10.1175/BAMS-D-22-0046.1.

    • Search Google Scholar
    • Export Citation
  • Weyn, J. A., D. R. Durran, and R. Caruana, 2020: Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere. J. Adv. Model. Earth Syst., 12, e2020MS002109, https://doi.org/10.1029/2020MS002109.

    • Search Google Scholar
    • Export Citation
  • White, C. J., and Coauthors, 2017: Potential applications of Subseasonal-to-Seasonal (S2S) predictions. Meteor. Appl., 24, 315325, https://doi.org/10.1002/met.1654.

    • Search Google Scholar
    • Export Citation
  • White, C. J., and Coauthors, 2022: Advances in the application and utility of Subseasonal-to-Seasonal predictions. Bull. Amer. Meteor. Soc., 103, E1448E1472, https://doi.org/10.1175/BAMS-D-20-0224.1.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2013: The calibration simplex: A generalization of the reliability diagram for three-category probability forecasts. Wea. Forecasting, 28, 12101218, https://doi.org/10.1175/WAF-D-13-00027.1.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2020: Statistical Methods in the Atmospheric Sciences. 4th ed. Elsevier, 840 pp., https://doi.org/10.1016/C2017-0-03921-6.

  • Zhang, L., T. Yang, S. Gao, Y. Hong, Q. Zhang, X. Wen, and C. Cheng, 2023: Improving subseasonal-to-seasonal forecasts in predicting the occurrence of extreme precipitation events over the contiguous U.S. using machine learning models. Atmos. Res., 281, 106502, https://doi.org/10.1016/j.atmosres.2022.106502.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Sum of the forecast probabilities for the three tercile categories as provided by the ECMWF baseline within the framework of the S2S AI Challenge. Shown is the average over all forecasts from 2020.

  • Fig. 2.

    The UNet architecture used for temperature (for precipitation, one additional predictor is used, resulting in six inputs). The number of global maps (channels) is denoted on top of each box, the spatial extent of these feature maps is given on the left side of each box for the patchwise training (left number) and the global training (right number).

  • Fig. 3.

    The architecture of the BF-CNN and the TC-CNN model for temperature (for precipitation, one dense layer less is used in the encoder and the input contains six channels). The two architectures share the same encoder, but use different strategies to derive a spatial prediction from the lower-dimensional representation. The decoder of the basis-function CNN is shown in the middle on top, and the corresponding decoder for the CNN with transposed convolutions is shown on the right. The number of channels is denoted on top of each box, and the size of the feature maps or the length of the vector is given on the left side of each box. Dark gray shaded boxes symbolize three-dimensional matrices, while light gray boxes correspond to two-dimensional fields, and the thin black lines denote one-dimensional vectors.

  • Fig. 4.

    Visualization of example input patches (red) and the respective output patches (green) for the patchwise models. The solid and dashed squares represent two different patches used for creating a connected spatial prediction for a larger domain.

  • Fig. 5.

    Globally aggregated RPSS for the test year 2020. RPSS values larger than zero indicate that forecasts are better than climatology. The RPSS of the ECMWF temperature forecasts for a lead time of 28 days is positive, but orders of magnitudes too small to be visible in the figure.

  • Fig. 6.

    Example predictions for temperature with a lead time of 14 days issued on 2 Jan 2020. In the top row, the corresponding observations are displayed, followed by the predictions of the two benchmark methods and the four postprocessed forecasts.

  • Fig. 7.

    Gridcell-wise RPSS for temperature for 2020. RPSS values larger than zero indicate that forecasts are better than climatology.

  • Fig. 8.

    Gridcell-wise RPSS for precipitation for 2020. RPSS values larger than zero indicate that forecasts are better than climatology. Very dry grid cells are omitted from the evaluation (e.g., parts of the Saharan Desert).

  • Fig. 9.

    Calibration simplices for temperature and precipitation of the corrected ECMWF baseline and the global UNet. The simplices are based on all forecasts for 2020, but only consider land grid cells north of 60°S, i.e., omit Antarctica. For each of the three terciles, forecast probabilities are binned in 10 bins (from 0/9 to 9/9) and the size of the dots in the hexagons represents the fraction of probability vectors falling in that “three-dimensional” bin. The position of the dots shows the observed conditional event frequency, i.e., the conditional occurrence of each tercile. The displacement of the dot with respect to the center of each hexagon (indicated by the red lines) shows the miscalibration error. A displacement to the edge of the respective hexagon corresponds to a probability mismatch of 0.6.

  • Fig. 10.

    Difference between the upper tercile edge (2/3 quantile) and the lower tercile edge (1/3 quantile) for four selected weeks of the year representing the four seasons. Shown are the tercile edges for mean temperature and accumulated precipitation for weeks 3–4.

  • Fig. C1.

    Example predictions for precipitation with a lead time of 14 days issued on 2 Jan 2020.