Beyond Ensemble Averages: Leveraging Climate Model Ensembles for Subseasonal Forecasting

Elena Orlova aDepartment of Computer Science, University of Chicago, Chicago, Illinois

Search for other papers by Elena Orlova in
Current site
Google Scholar
PubMed
Close
,
Haokun Liu aDepartment of Computer Science, University of Chicago, Chicago, Illinois

Search for other papers by Haokun Liu in
Current site
Google Scholar
PubMed
Close
,
Raphael Rossellini bDepartment of Statistics, University of Chicago, Chicago, Illinois

Search for other papers by Raphael Rossellini in
Current site
Google Scholar
PubMed
Close
,
Benjamin A. Cash cDepartment of Atmospheric, Oceanic, and Earth Sciences, George Mason University, Fairfax, Virginia

Search for other papers by Benjamin A. Cash in
Current site
Google Scholar
PubMed
Close
, and
Rebecca Willett aDepartment of Computer Science, University of Chicago, Chicago, Illinois
bDepartment of Statistics, University of Chicago, Chicago, Illinois

Search for other papers by Rebecca Willett in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Producing high-quality forecasts of key climate variables, such as temperature and precipitation, on subseasonal time scales has long been a gap in operational forecasting. This study explores an application of machine learning (ML) models as postprocessing tools for subseasonal forecasting. Lagged numerical ensemble forecasts (i.e., an ensemble where the members have different initialization dates) and observational data, including relative humidity, pressure at sea level, and geopotential height, are incorporated into various ML methods to predict monthly average precipitation and 2-m temperature 2 weeks in advance for the continental United States. For regression, quantile regression, and tercile classification tasks, we consider using linear models, random forests, convolutional neural networks, and stacked models (a multimodel approach based on the prediction of the individual ML models). Unlike previous ML approaches that often use ensemble mean alone, we leverage information embedded in the ensemble forecasts to enhance prediction accuracy. Additionally, we investigate extreme event predictions that are crucial for planning and mitigation efforts. Considering ensemble members as a collection of spatial forecasts, we explore different approaches to using spatial information. Trade-offs between different approaches may be mitigated with model stacking. Our proposed models outperform standard baselines such as climatological forecasts and ensemble means. In addition, we investigate feature importance, trade-offs between using the full ensemble or only the ensemble mean, and different modes of accounting for spatial variability.

Significance Statement

Accurately forecasting temperature and precipitation on subseasonal time scales—2 weeks–2 months in advance—is extremely challenging. These forecasts would have immense value in agriculture, insurance, and economics. Our paper describes an application of machine learning techniques to improve forecasts of monthly average precipitation and 2-m temperature using lagged physics-based predictions and observational data 2 weeks in advance for the entire continental United States. For lagged ensembles, the proposed models outperform standard benchmarks such as historical averages and averages of physics-based predictions. Our findings suggest that utilizing the full set of physics-based predictions instead of the average enhances the accuracy of the final forecast.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Elena Orlova, eorlova@uchicago.edu

Abstract

Producing high-quality forecasts of key climate variables, such as temperature and precipitation, on subseasonal time scales has long been a gap in operational forecasting. This study explores an application of machine learning (ML) models as postprocessing tools for subseasonal forecasting. Lagged numerical ensemble forecasts (i.e., an ensemble where the members have different initialization dates) and observational data, including relative humidity, pressure at sea level, and geopotential height, are incorporated into various ML methods to predict monthly average precipitation and 2-m temperature 2 weeks in advance for the continental United States. For regression, quantile regression, and tercile classification tasks, we consider using linear models, random forests, convolutional neural networks, and stacked models (a multimodel approach based on the prediction of the individual ML models). Unlike previous ML approaches that often use ensemble mean alone, we leverage information embedded in the ensemble forecasts to enhance prediction accuracy. Additionally, we investigate extreme event predictions that are crucial for planning and mitigation efforts. Considering ensemble members as a collection of spatial forecasts, we explore different approaches to using spatial information. Trade-offs between different approaches may be mitigated with model stacking. Our proposed models outperform standard baselines such as climatological forecasts and ensemble means. In addition, we investigate feature importance, trade-offs between using the full ensemble or only the ensemble mean, and different modes of accounting for spatial variability.

Significance Statement

Accurately forecasting temperature and precipitation on subseasonal time scales—2 weeks–2 months in advance—is extremely challenging. These forecasts would have immense value in agriculture, insurance, and economics. Our paper describes an application of machine learning techniques to improve forecasts of monthly average precipitation and 2-m temperature using lagged physics-based predictions and observational data 2 weeks in advance for the entire continental United States. For lagged ensembles, the proposed models outperform standard benchmarks such as historical averages and averages of physics-based predictions. Our findings suggest that utilizing the full set of physics-based predictions instead of the average enhances the accuracy of the final forecast.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Elena Orlova, eorlova@uchicago.edu

1. Introduction

High-quality forecasts of key climate variables such as temperature and precipitation on subseasonal time scales, defined here as the time range between 2 weeks and 2 months, have long been a gap in operational forecasting (Ban et al. 2016). Advances in weather forecasting on time scales of days to about a week (Lorenc 1986; National Academies of Sciences 2016; National Research Council (NRC) 2010; Simmons and Hollingsworth 2002) or seasonal forecasts on time scales of 2–9 months (Barnston et al. 2012) do not necessarily translate to the challenging subseasonal regime. Addressing the crucial need for forecasts on the seasonal-to-subseasonal (S2S) time scale, collaborative initiatives led by the World Weather Research Programme and the World Climate Research Programme aim to advance S2S forecasting by focusing on mesoscale–planetary-scale interactions, high-resolution simulations, data assimilation methods, and tailored socioeconomic support (Brunet et al. 2010). Skillful forecasts on subseasonal time scales would have immense value in agriculture, insurance, and economics (White et al. 2022; Mouatadid et al. 2023). The importance of improved subseasonal predictions is also detailed by Ban et al. (2016) and NRC (2010).

The National Centers for Environmental Prediction (NCEP), part of the National Oceanic and Atmospheric Administration (NOAA), currently issues a “week 3–4 outlook” for the contiguous United States (CONUS).1 The NCEP outlooks are constructed using a combination of dynamical and statistical forecasts, with statistical forecasts based largely on how conditions in the past have varied (linearly) with indices of El Niño–Southern Oscillation (ENSO), Madden–Julian Oscillation (MJO), and global warming (i.e., the 30-yr trend). There exists great potential to advance subseasonal forecasting (SSF) using machine learning (ML) techniques. Haupt et al. (2021) provide an overview of using ML methods for postprocessing of numerical weather predictions. Vannitsem et al. (2021) highlight the crucial role of statistical postprocessing techniques, including ML methods, in national meteorological services. They discuss theoretical developments and operational applications, current challenges, and potential future directions, particularly focusing on translating research findings into operational practices. A real-time forecasting competition called the Subseasonal Climate Forecast Rodeo (Hwang et al. 2018), sponsored by the Bureau of Reclamation in partnership with NOAA, USGS, and the U.S. Army Corps of Engineers, illustrated that teams using ML techniques can outperform forecasts from NOAA’s operational seasonal forecast system.

Here, we present a work focused on developing ML-based forecasts that leverage lagged ensembles (i.e., an ensemble whose members are initialized from a succession of different start dates) of forecasts produced by NCEP in addition to observed data and other features. Previous studies, including successful methods in the Rodeo competition (e.g., Hwang et al. 2019), incorporate the ensemble mean as a feature in their ML systems but do not use any other information about the ensemble. In other words, variations among the ensemble members are not reflected in the training data or incorporated into the learned model. In contrast, this paper demonstrates that the full ensemble contains important information for subseasonal forecasting outside the ensemble mean. Specifically, we consider the test case of predicting monthly 2-m temperatures and precipitation 2 weeks in advance over 3000 locations over the continental United States using physics-based predictions, such as NCEP-Climate Forecast System version 2 (CFSv2) hindcasts (Kirtman et al. 2014; Saha et al. 2014), using an ensemble of 24 distinct forecasts. We repeat this experiment for the Global Modeling and Assimilation Office from the National Aeronautics and Space Administration (NASA-GMAO) ensemble, which has 11 ensemble members (Nakada et al. 2018).

In this context, this paper makes the following contributions:

  • We train a variety of ML models (including neural networks, random forests, linear regression, and model stacking) that input all ensemble member predictions as features in addition to observations of geopotential heights, relative humidity, precipitation, and temperature from past months to produce new forecasts with higher accuracy than the ensemble mean; forecast accuracy is measured with a variety of metrics (section 7). These models are considered in the context of regression, quantile regression, and tercile classification. Systematic experiments are used to characterize the influence of individual ensemble members on the predictive skill [section 8a(2)].

  • The collection of ML models employed allows us to consider different modes of accounting for spatial variability. ML models can account for spatial correlations among both features and targets; for example, when predicting Chicago precipitation, our models can leverage information not only about Chicago but also about neighboring regions. Specifically, we consider the following learning frameworks: 1) training a predictive model for each spatial location independently; 2) training a predictive model that inputs the spatial location as a feature and hence can be applied to any single spatial location; and 3) training a predictive model for the full spatial map of temperature or precipitation—i.e., predicting an outcome for all spatial locations simultaneously. ML models present various ways to account for spatial variability, each with distinct advantages and disadvantages. Our application of model stacking (an ML technique where multiple models are combined, with their predictions used as input features for another model that produces the final prediction) allows our final learned model to exploit the advantages of each method.

  • We conduct a series of experiments to help explain the learned model and which features the model uses most to make its predictions. We systematically explore the impact of using lagged observational data in addition to ensemble forecasts and positional encoding to account for spatial variations (section 8c).

  • The ensemble of forecasts from a physics-based model (e.g., NCEP-CFSv2 or NASA-GMAO) contains information salient to precipitation and temperature forecasting besides their mean, and ML models that leverage the full ensemble generally outperform methods that rely on the ensemble mean alone [section 8a(1)].

  • Finally, we emphasize that the final validation of our approach was conducted on data from 2011 to 2020 that were not used during any of the training, model development, parameter tuning, or model selection steps. We only conducted our final assessment of the predictive skill for 2011–20 after we had completed all other aspects of this manuscript. Because of this, our final empirical results accurately reflect the anticipated performance of our methods on new data.

This paper is organized as follows: section 2 discusses related work, section 3 introduces the data used in the experiments, section 4 describes forecasting problems, while baselines and learning-based methods are described in section 5, and experimental setup and evaluation metrics are given in section 6. Finally, we present our results in section 7 and discuss them in section 8. Conclusions and directions for future work are given in section 9.

2. Related work

While statistical models were common for weather prediction in the early days of weather forecasting (Nebeker 1995), forecasts using physics-based dynamic system models have been carried out since the 1980s and have been the dominant forecasting method in climate prediction centers since the 1990s (Barnston et al. 2012). Many physics-based forecast models are used both in academic research and operationally. Such systems often produce ensembles of forecasts—e.g., the result of running a physics-based simulation multiple times with different initial conditions or parameters—and are a mainstay of operational forecast centers around the globe.

Recently, skillful ML approaches have been developed for short-range weather prediction (Chen et al. 2023; Nagaraj and Kumar 2023; Frnda et al. 2022; Herman and Schumacher 2018; Ghaderi et al. 2017; Grover et al. 2015; Radhika and Shashi 2009; Cofiño et al. 2002) and longer-term weather forecasting (Lam et al. 2023; Yang et al. 2023; Chen et al. 2023; Hewage et al. 2021; Cohen et al. 2019; Totz et al. 2017; Iglesias et al. 2015; Badr et al. 2014). However, forecasting on the subseasonal time scale, with 2–8 week outlooks, has been considered a far more difficult task than seasonal forecasting due to its complex dependence on both local weather and global climate variables (Vitart et al. 2012; Min et al. 2020). Seasonal prediction also benefits from targeting a much larger averaging period.

Some ML algorithms for subseasonal forecasting use purely observational data (i.e., not using any physics-based ensemble forecasts). He et al. (2020) focus on analyzing different ML methods, including gradient boosting trees and deep learning (DL) for SSF. They propose a careful construction of feature representations of observational data and show that ML methods are able to outperform a climatology baseline, i.e., predictions corresponding to the 30-yr mean at a given location and time. This conclusion is based on comparing the relative R2 scores for the ML approaches and climatology. Srinivasan et al. (2021) propose a Bayesian regression model that exploits spatial smoothness in the data.

Other works use the ensemble mean as a feature in their ML models. For example, in the subseasonal forecasting Rodeo (Hwang et al. 2018), a prediction challenge for temperature and precipitation at weeks 3–4 and 5–6 in the western United States sponsored by NOAA and the U.S. Bureau of Reclamation, simple yet thoughtful statistical models consistently outperform NOAA’s dynamical systems forecasts. In particular, the winning approach uses a stacked model from two nonlinear regression models, a selection of climate variables such as temperature, precipitation, sea surface temperature, and sea ice concentration and a collection of physics-based forecast models including the ensemble mean from various modeling centers in the North American Multi-Model Ensemble (NMME). From the local linear regression with multitask feature selection model analysis, the ensemble mean is the first- or second-most important feature for forecasting, especially for precipitation. He et al. (2021) perform a comparison of modern ML models that use data from the Subseasonal Experiment (SubX) project for SSF in the western contiguous United States. The experiments show that incorporating the ensemble mean as an input feature to ML models leads to a significant improvement in forecasting performance, but that work does not explore the potential value of individual ensemble members aside from the ensemble mean. Grönquist et al. (2020) note that physics-based ensembles are computationally demanding to produce and propose an ML method that can input a subset of ensemble forecasts and generate an estimate of the full ensemble; they observe that the output ensemble estimate has more prediction skill than the original ensemble. Loken et al. (2022) analyze the forecast skill of random forests leveraging the ensemble members for next-day severe weather prediction compared to only using the ensemble mean. However, their results only cover forecasts with a lead time of up to 48 h, so it is unclear if their methods would have succeeded in the tougher subseasonal forecasting setting.

This paper complements the prior work above by developing powerful learning-based approaches that incorporate both physics-based forecast models and observational data to improve SSF over CONUS.

3. Data

Table 1 describes variables used in the experiments. Climatological means of precipitation and temperature are calculated using 1971–2000 NOAA data (NOAA 2022). There are many ensembles of physics-based predictions produced by forecasting systems. NMME is a collection of physics-based forecast models from various modeling centers in North America, including NOAA/NCEP, NOAA/Geophysical Fluid Dynamics Laboratory (GFDL), International Research Institute for Climate and Society (IRI), National Center for Atmospheric Research (NCAR), NASA, and Canadian Meteorological Centre (Kirtman et al. 2014). NMME provides forecasts from multiple global forecast models from North American modeling centers (Kirtman et al. 2014). The NMME project has two predictive periods: hindcast and forecast. A hindcast period refers to when a dynamic model reforecasts historical events, which can help climate scientists develop and test new models to improve forecasting and to evaluate model biases. In contrast, a forecast period has real-time predictions generated from dynamic models.

Table 1.

Description of climate variables and their data sources. Our target climate variables for subseasonal forecasting are precipitation and 2-m temperature. We use NOAA data to calculate the climatology from 1971 to 2000. We also perform linear spatial interpolation on the historical values to get values with the same resolution and support as target climate variables.

Table 1.

In this manuscript, we use ensemble forecasts from the NMME’s NCEP-CFSv2 (Kirtman et al. 2014; Saha et al. 2014), which has K = 24 ensemble members at a 1° × 1° resolution over a 2-week lead time. NCEP-CFSv2 is the operational prediction model currently used by the U.S. Climate Prediction Center. The NCEP-CFSv2 model has two different products available in the NMME archive: we use its hindcasts from 1982 to 2010 for training and validation of our models, and we use its forecasts from April 2011 to December 2020 for the final evaluation of our models.

In order to ensure our results are not unique to a single forecasting model, we also analyze output from the NASA-Global Modeling and Assimilation Office (GMAO) from the Goddard Earth Observing System Model, version 5 (GEOS-5; Nakada et al. 2018), which has K = 11 ensemble members at a 1° × 1° resolution over a 2-week lead time. Similarly, we use its hindcasts from 1981 to 2010 for training and validation of our models and we use its forecasts from January 2011 to 2018 for final evaluation. The test periods of NCEP-CFSv2 and NASA-GMAO data differ due to data availability. Note that the identical version of each model is used to generate the test, train, and validation data.

Different ensemble members correspond to different initial conditions of the underlying physical model. The NCEP-CFSv2 forecasts are initialized in the following way: four initializations at times 0000, 0600, 1200, and 1800 UTC every fifth day, starting 1 month prior to the lead time of 2 weeks [Table B1 in Saha et al. (2014)]. NASA-GMAO is a fully coupled atmosphere–ocean–land–sea ice model, with five forecasts initialized every 5 days. While additional members are generated through perturbation methods closest to the beginning of each month,2 not all members are initialized on different dates, meaning that the ensemble is not strictly lagged. However, NASA-GMAO members are not interchangeable, as each is created using a distinct method.

All data are interpolated to lie on the same 1° × 1° grid, resulting in L = 3274 U.S. locations. Climate variables available daily (such as pressure at sea level or precipitation) are converted to monthly average values. When data are available as monthly averages only, we ensure that our forecast for time t + δt does not use any information from the interval (t, t + δt).

4. Forecast tasks

The learning task can be formulated as learning a model fθ : Xy with parameters θ. This model fθ can be a linear regression (where θ is a set of regression weights), the mean of ensemble members (no θ needs to be learned), a random forest (where θ parameterizes the set of trees in the forest), a convolutional neural network (where θ is the collection of neural network weights), or other learned models. We consider three forecasting tasks: regression, tercile classification, and quantile regression.

a. Regression

The goal of regression is to predict monthly average values of precipitation and 2-m temperature 2 weeks in the future. These models are generally trained using the squared error loss function:
lsq-err(θ)=E{[yfθ(x)]2}.

b. Tercile classification

The goal of tercile classification is to predict whether the precipitation or 2-m temperature will be “high” (above the 66th percentile, denoted by q = 1), “medium” (between the 33rd and the 66th percentiles, denoted by q = 0), or “low” (below the 33rd percentile, denoted by q = −1). We compute these percentile values using the 1971–2000 climatology (see section 3 for details), and these percentiles are computed for each calendar month m and location l pair. These models are generally trained using the cross-entropy loss function:
lCE(θ)=E{q=11I{y=q}log[fθ(x)]q},
where I{A}:={1,ifAtrue0,ifAfalse is the indicator function and [fθ(x)]q is the predicted probability that the target y corresponding to feature vector x will be in tercile q.

c. Quantile regression

For a given percentile α, the goal of quantile regression is to predict the value z so that, conditioned on features x, the target y satisfies yz with probability α. When we set α to a value close to one, such as α = 0.9, this value z indicates what we can expect in “extreme outcomes,” not just on average. These models are generally trained using the pinball loss function:
lquantile(θ)=E{ρα[yfθ(x)]},
where
ρα(z):=z[αI{(z<0)}]={α|z|ifz0(1α)|z|ifz<0.

5. Prediction methods

Our goal is to predict either the monthly average precipitation or the monthly average 2-m temperature 2 weeks in advance (for example, we predict the average monthly precipitation for February on 15 January). This section describes the notation used for features and targets, baselines and learning methods, and how spatial features are accounted for.

a. Notation

We let T denote the number of time steps used in our analysis and L denote the number of spatial locations. We define the following variables:

  • The term ut,l(k) is the kth ensemble member at time t and location l, where k = 1, …, K, t = 1, …, T, and l = 1, …, L. Every ensemble member represents the output of a given physics-based model forecast from different initial states.

  • The term υt,l(p) is the pth observational variable, such as precipitation or temperature, geopotential height at 500 hPa, relative humidity near the surface, pressure at sea level, and sea surface temperature, at time t and location l, with p = 1, …, P.

  • The terms zl(1)and  zl(2) represent information about longitude and latitude of location l, respectively; each is a vector of length d. More details about this representation, called positional encoding (PE), can be found in section 6a.

  • The term xt,l:=[ut,l(1),,ut,l(K),υt,l(1),υt,l(P),zl(1),zl(2)] is a set of features at time t and location l.

  • The term yt,l is the target, the ground truth monthly average precipitation or 2-m temperature at the target forecast time t + δt at location l, where δt = 14 days is our forecast horizon. For simplicity, we use a subscript t for yt,l instead of t + δt to match with the input features notation. The same holds for our ensuing definitions.

  • The term y^t,l is the output of a forecast model for a given task at target forecast time t + δt and location l.

  • The term sm,l is a 30-yr mean (climatology) of an observed climate variable, such as precipitation or temperature, at a month m = 1, …, 12 and location l.

  • The term s^m,l is a 30-yr climatology of a predicted climate variable, such as precipitation or temperature, at a month m = 1, …, 12 and location l. For each location l and each month m, it is calculated as a mean of ensemble member predictions over the training period, as defined formally in Eq. (7).

  • The terms yt,lanomaly and y^t,lanomaly are anomaly predictions and a true anomaly, at a month m = 1, …, 12 and location l. They are used during evaluation. We define anomalies as
    yt,lanomaly=yt,lsm(t),l,
    y^t,lanomaly=y^t,lsm(t),l.

For the special case of y^ corresponding to the ensemble mean, the ensemble members may exhibit bias, in which case we also consider y^t,lanomaly=y^t,ls^m(t),l, where s^m(t),l is evaluated on the model’s (ensemble mean) predictions:
s^m,l:=1Tt=1Ty^t,lI{m=m(t)},l=1,,L.
The model climatology s^m,l is computed using the training data. Note that we do not subtract the climatology from the input features and target variables, i.e., precipitation and temperature, when training our ML models. We subtract climatology from the model outputs only when evaluating their performance, as including climatology in the inputs to our ML models during training improves performance. Section 6c provides more details on the model evaluation.

In our analyses, the number of locations is L = 3274, and there are K = 24 NCEP-CFSv2 ensemble members or K = 11 NASA-GMAO ensemble members. The ensemble members are used as input features to the learning-based methods as they are; we do not perform any feature extraction from them. The number of observational variables is usually P = 17. The details on these variables can be found in sections 3 and 6b.

The target variable y is observed from 1985 to 2020. Data from January 1985 to September 2005 are used for training (249 time steps), and data from October 2005 to December 2010 are used for validation and model selection (63 time steps). Data from 2011 to 2020 (or from 2011 to 2018 in the case of NASA-GMAO data) are used to test our methods after all model development, selection, and parameter tuning are completed.

b. Baselines

1) Climatology

It is the fundamental benchmark for weather and climate predictability. In particular, for a given time t, let m(t): = (tmod12) correspond to the calendar month corresponding to t; then, we compute the 30-yr climatology of the target variable for a given location and time via
y^t,lhist=sm(t),l,t=1,,T,l=1,,L.

2) Ensemble mean

This is the mean of all ensemble members for each location l at each time step t:
y^t,lensmean:=1Kk=1Kut,l(k),t=1,,T,l=1,,L.

3) Linear regression

Finally, we consider, as a baseline, a linear regression model applied to input features corresponding to ensemble member predictions: xt,l=[ut,l(1),,ut,l(K)]. Then, the model’s output
y^t,lLR:=θl,xt,l+θl0,
where θl are the trained coefficients for input features for each location l and θl0 are the learned intercepts for each location l. Note that we train a different model for each spatial location, and the illustration for this model and its input’s format is given in Fig. 1a.
Fig. 1.
Fig. 1.

An illustration of different forecasting paradigms: (a) spatial independence models with a model for each spatial location, no accounting for spatial information; (b) conditional spatial independence models with one model for all locations, might consider the spatial information; (c) spatial dependence models that account for the spatial information by design. We replace “precipitation” in the illustration with “temperature” for temperature prediction, but the overall structure remains the same.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

c. Learning-based methods

1) LR

In contrast to the linear regression (LR) baseline, here other climate variables are added to the input features: xt,l=[ut,l(1),,ut,l(K),υt,l(1),,υt,l(P)]. Then, the model’s output is defined with Eq. (10). Because the feature vector is higher dimensional here than for the baseline, the learned θl is also higher dimensional. We train a different model for each spatial location. In our experiments with linear models, we do not include positional encoding [zl(1),zl(2)] as input features, since they would be constants for each location’s linear model.

In the context of regression, we minimize the squared error loss. The linear quantile regressor (Linear QR) is a linear model trained to minimize the quantile loss:
lQR=1Ll=1L[1Tt=1Tρτ(yt,ly^t,l,)],
where ρτ is defined in Eq. (4).

2) Random forest

In the context of regression and tercile classification, we train random forests that use ensemble predictions, the spatial location, and additional climate features to form the feature vector xt,l=[ut,l(1),,ut,l(K),υt,l(1),,υt,l(P),zl(1),zl(2)] for all location l and time t pairs. One random forest is trained to make predictions for any spatial location. The illustration for RF and its input’s format is given in Fig. 1b: we train one RF model for all locations, and the spatial information is encoded as input features via PE vectors zl(1),zl(2).

In the context of quantile regression, we train a random forest quantile regressor (RFQR, Meinshausen 2006), which grows trees the same way as the original random forest while storing all training samples. To make a prediction for a test point, the RFQR computes a weight for each training sample that corresponds to the number of leaves (across all trees in the forest) that contain the test sample and the training sample. The RFQR prediction is then a quantile of the weighted training samples across all leaves that contain the test sample. We show a figure representation of the RFQR in appendix D, section a(2). With this formulation, training a single RFQR for all locations is computationally demanding, so we train individual RFQRs for every location.

Random forests are often referred as the best off-the-shelf classifiers (Hastie et al. 2009) even using the default hyperparameters (Biau and Scornet 2016). Our cross-validation (CV) and grid search experiments show that the random forest (RF) hyperparameters have little impact on the accuracy, and thus, we use the default parameters for RFs from the Scikit-learn library (Pedregosa et al. 2011).

3) Convolutional neural network

To produce a forecast map for the United States, we adapted a U-Net architecture (Ronneberger et al. 2015), which has an encoder–decoder structure with convolutional layer blocks. The U-Net maps a stack of images to an output image; in our context, we treat each spatial map of a climate variable or forecast as an image. Thus, the input to our U-Net can be represented as a tensor composed of matrices: Xt=[Ut(1),,Ut(K),Vt(1),,Vt(P),Z(1),Z(2)].

Note that here we use capital letters because the input to our U-Net consists of 2D spatial maps, which are represented as matrices instead of vectors. The model output is a spatial map of the predicted target. This process is illustrated in Fig. 1c.

For the U-Net, we modify an available PyTorch implementation (Yakubovskiy 2020). The training set consists of 249 samples (images), which may be considered relatively limited for CNN training. To address this concern, we conduct bootstrapping experiments for the U-Net architecture, offering detailed insights into the impact of sample size on model performance. Further details are presented in appendix C, section b. We use a 10-fold CV over our training data and grid search to select parameters such as learning rate, weight decay, batch size, and number of epochs. The Adam optimizer (Kingma and Ba 2014) is used in all experiments. After selecting hyperparameters, we train the U-Net model with those parameters on the full training dataset. The validation set is used to perform feature importance analysis. For regression, we train using squared error loss. In the context of quantile regression, we initialize the weights with those learned on squared error loss and then train on the quantile loss [Eq. (11)].

4) Nonlinear model stacking

Model stacking can improve model performance by combining the outputs of several models (usually called base models) (Pavlyshenko 2018). In our case, linear regression, random forests, and the U-Net are substantially different in architecture and computation and we observe that they produce qualitatively different forecasts. We stack the linear model, random forest, and U-Net forecasts using a nonparametric approach:
y^t,l=h(y^t,lLR,y^t,lRF,y^t,lUNET),
where h is a simple feed-forward neural network with a nonlinear activation and y^t,lLR,y^t,lRF,y^t,lUNET are the predictions of a linear model, random forest, and the U-Net correspondingly and referred to as “base models.” One stacking model is trained to make predictions for any spatial location. Figure 1b with input features that are predictions from other ML models and no PE vectors demonstrates the stacking model’s framework. Model stacking can improve the forecast quality by combining predictions from three forecasting paradigms, spatial independence, conditional spatial independence, and spatial dependence (section 5d), and is analogous to the multimodel ensemble approach commonly used in weather and climate forecasting. The architecture details can be found in appendix D.

We apply the following procedure for model stacking: the base models are first trained on half of the training data and predicted values on the second half are used to train the stacking model h. Then, we retrain the base models on all the training data and apply the trained stacked model to the outputs of the base models. The proposed procedure helps to avoid overfitting.

d. Models of spatial variation

We consider three different forecasting paradigms. In the first, which we call the spatial independence model, we ignore all spatial information and train a separate model for each spatial location. In the second, which we call the conditional spatial independence model, we consider samples corresponding to different locations l as independent conditioned on the spatial location as represented by features [zl(1),zl(2)]. In this setting, a training sample corresponds to (xi, yi) = (xt,l, yt,l), where, with a small abuse of notation, we let i index a t, l pair. In this case, the number of training samples is n = TL. In the third paradigm, which we call the spatial dependence model, we consider a single training sample as corresponding to full spatial information (across all l) for a single t; that is, (Xi,Yi)=([xt,l]l=1,,L,[yt,l]l=1,,L), where now i indexes t alone. Models developed under the spatial dependence model account for the spatial variations in the features and targets. For instance, a convolutional neural network might input “heatmaps” representing the collection of physics-based model forecasts across the continental United States and output a forecast heatmap predicting spatial variations in temperature or precipitation instead of treating each spatial location as an independent sample.

Figure 1 shows general frameworks of these paradigms. All models combine information from all the different ensemble forecasts, and so in a broad sense, we can think of each prediction at a given time and location as a weighted sum of the ensemble forecasts across space, time, and ensemble members, where the weights are learned during the model training and may be data dependent (i.e., nonlinear). From this perspective, we may think of different modeling paradigms as essentially placing different constraints on those weights:

  • Under spatial independence models, the weights may vary spatially but do not account for spatial correlations in the data.

  • Under conditional spatial independence models, the interpretation depends on the model being trained—linear models have the same weights on ensemble predictions regardless of spatial location, while nonlinear models (e.g., random forests) have weights that may depend on the spatial location.

  • Under spatial dependence models, the weights vary spatially, depend on the spatial location, and account for spatial correlations among the ensemble forecasts and other climate variables.

6. Experimental setup

This section provides details on the experimental setup, including positional encoding, removing climatology, and evaluation metrics. Data preprocessing details are presented in appendix E.

a. Positional encoding

Positional encoding (Vaswani et al. 2017) is a technique used in natural language processing (NLP) to inject positional information into data. In sequence-based tasks, such as language translation or text generation, the order of elements in the input sequence is important, but neural networks do not naturally capture this information. PE assigns unique encodings to each position in the sequence, which are then added to the original input before being processed by the model. This enables the model to consider the order and relative positions of elements, improving its ability to capture local and global context within a sequence and make accurate predictions (Devlin et al. 2018; Petroni et al. 2019; Narayanan et al. 2016). This technique is helpful to represent the positional information outside the original NLP tasks (Gamboa 2017; Gehring et al. 2017; Khan et al. 2022). Several of our models use the spatial location as an input feature. Rather than directly using latitudes and longitudes, we use PE (Vaswani et al. 2017):
zl(1)(i)=PE(l,2i)=sin(l/100002i/d),
zl(2)(i)=PE(l,2i+1)=cos(l/100002i/d),
where l is a longitude or latitude value, d = 12 is the dimensionality of the positional encoding, and i ∈ {1, …, d} is the index of the positional encoding vector. For the U-Net model, PE vectors are transformed into images in the following way: we take every value in the vector and fill the image of the desired size with this value. So, there are d images with the corresponding PE values. For the RF models, PE vectors can be used as they are.

b. Models’ inputs details

Based on the available data, we use the following input features for our ML models:

  • K ensemble forecasts for the target month.

  • Four climate variables: relative humidity, pressure, geopotential height, and temperature (if the target is precipitation) or precipitation (if the target is temperature) 2 months before the target month.

  • The lagged target variable (the target variable 2, 3, 4, 12, and 24 months before the target date—five additional features).

  • SSTs that are represented via principal components (PCs).

  • Finally, the positional embeddings.

SSTs are usually represented as eight PCs, and the embedding vector size is usually d = 12 as we describe in section 6a. For example, using the NCEP-CFSv2 members, there are 24K+4+5+8P+12d×2=65 input features for every time step and location. Figure 1 provides an illustration of these input features.

c. Evaluation metrics

1) Regression metrics

The forecast skill of our regression models is measured using the R2 value. For each location l and ground-truth values yt,l and predictions y^t,l at this location, we compute
Rl2=1t=1T(yt,lanomalyy^t,lanomaly)2t=1T(yt,lanomalyy¯lanomaly)2,
where
y¯lanomaly=1Tt=1Ty^t,lanomaly.
Then, the average R2 for all locations is calculated as
R2=1Ll=1LRl2.
In addition to the average R2 on the test data, we also estimate the median R2 score across all U.S. locations.
We further report the mean squared error (MSE) of our predictions across all locations:
MSEl:=1Tt=1T(yt,ly^t,l)2,
for l = 1, …, L, and
MSE=1Ll=1LMSEl.
We also report the standard error (SE), median, and 90th percentile of {MSEl}l. We say the difference between the two models is significant if their MSE ± SE intervals do not overlap. Note that the standard errors provided here should be used with caution since there are significant spatial correlations in the MSE values across locations, so we do not truly have L independent samples from an asymptotically normal distribution.

2) Tercile classification metrics

We estimate the accuracy of our tercile classification predictions as the proportion of correctly classified samples out of all observations.

3) Quantile regression metrics

For the quantile regression task, we report mean quantile loss from Eq. (11) across all locations.

7. Experimental results

In this section, we report the predictive skill of different models applied to SSF over the continental United States using NCEP-CFSv2 ensemble members for regression and quantile regression. Precipitation forecasting is known to be more challenging compared to temperature forecasting (Knapp et al. 2011). The results for the NASA-GMAO dataset are presented in appendix A. The skill of different models on the tercile classification task is presented in appendix B for both datasets. Recall that all methods are trained on data spanning January 1985–September 2005, with data spanning October 2005–December 2010 used for validation (i.e., model selection and hyperparameter tuning). Test data spanning 2011–20 was not viewed at any point of the model development and training process and only used to evaluate the predictive skill of our trained models on previously unseen data; we refer to this period as the “test period.” As a navigation tool for the reader, Table 2 gives references to the presented results for different tasks.

Table 2.

A table with references to the main results.

Table 2.

a. Regression

1) Precipitation regression using NCEP-CFSv2

Precipitation regression results are presented in Table 3. While the individual ML approaches produce results generally similar to those of the baselines, the stacked ML model, in particular, outperforms the baseline models in almost all metrics. Note that the best R2 value, associated with the stacked model, is still near zero; while this is a significant improvement over, for example, the ensemble mean, which has an R2 value of −0.08, the low values for all methods indicate the difficulty of the forecasting problem. It is important to note that R2 measures the accuracy of a model relative to a baseline corresponding to the mean of the target over the test period—that is, relative to a model that could never be used in practice as a forecaster because it uses future observations. The best practical analog to this would be the mean of the target over the train period—what we call the “historical mean” or climatology model. These two models are not the same, possibly because of the nonstationarity of the climate (Min et al. 2020). Thus, even when our R2 values are negative (i.e., we perform worse than the impractical mean of the target over the test period), we still perform much better than the practical climatology predictor. The model stacking approach is applied to the models trained on all available features (i.e., ensemble members, PE, and climate variables; linear regression is trained on all features except PE). We decide what models to include in the stacking approach based on their performance on validation data. The low 90th percentile error implies that our methods not only have high skill on average but also that there are relatively few locations with large errors. While acknowledging the overall performance may not be exceptional, it is important to recognize the potential of machine learning methods in improving the quality of estimates relative to the standard baselines. To further evaluate the capabilities of the stacking approach, we also apply the approach to the baseline predictions, which include historical and ensemble means, as well as linear regression. The performance of the stacked baseline model exceeds that of any of the individual baseline models and is similar to the performance of the stacked ML approach in terms of the R2 metric. However, the stacked ML approach outperforms it in all MSE-based metrics, indicating that the ML techniques can still provide additional skill even for as notoriously challenging a quantity as precipitation.

Table 3.

Results for precipitation regression using the NCEP-CFSv2 ensemble, with errors reported over the test period. LR refers to linear regression on all features, including ensemble members, lagged data, climate variables, and SSTs. ML model stacking is performed on models that are trained on all features. The best results are in bold. MSE is reported in squared millimeters.

Table 3.

Figure 2 illustrates performance of key methods with R2 heatmaps over the United States to highlight spatial variation in errors. The RF and U-Net R2 fields are qualitatively similar, but they are still quite different in certain states such as Georgia, North Carolina, Virginia, Utah, and Colorado. The LR map is noticeably poor across most of the regions. The stacked ML model’s heatmap reveals large regions where its predictive skill exceeds that of all other methods. Note that model stacking yields relatively accurate predictions even in regions where the three constituent models individually perform poorly (e.g., southwestern Arizona), highlighting the generalization abilities of our stacking approach. All methods tend to have higher accuracy in the Pacific Coast, in the Midwest, and in southern states such as Alabama and Missouri. The stacking model heatmaps both look similar. The stacking model applied to the baselines has better R2 scores in California compared to the stacked ML methods. However, the stacked ML model reveals larger positive R2 regions and fewer dark red spots, particularly evident in New Mexico, Minnesota, and Utah.

Fig. 2.
Fig. 2.

The R2 score heatmaps of baselines and learning-based methods for precipitation regression using NCEP-CFSv2 ensemble members; errors are computed over the test period. Positive values (blue) indicate better performance. See section 7a(1) for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

2) Temperature regression

Table 4 shows results for 2-m temperature regression. The learning-based models, especially the random forest and stacked model, significantly outperform the baseline models in terms of MSE and R2 score. The random forest also outperforms linear regression and the U-Net. Note that LR, U-Net, and RF are trained without using SST information since SST features yielded worse performance over the validation period. Figure 3 illustrates the performance of these methods with R2 heatmaps over the United States As expected, the model stacking approach shows the best results across spatial locations. We notice that there are still regions such as the West, some regions in Texas, Florida, and Georgia where all models tend to achieve a negative R2 score.

Table 4.

Results for temperature regression using the NCEP-CFSv2 ensemble, with errors reported over the test period. LR refers to linear regression on all features including ensemble members, lagged data, and land variables. Model stacking is performed on models that are trained on all features except SSTs. The best results are in bold. MSE is reported in squared degrees Celsius.

Table 4.
Fig. 3.
Fig. 3.

The R2 score heatmaps of baselines and learning-based methods for temperature regression using NCEP-CFSv2 ensemble members; errors are computed over the test period. Positive values (blue) indicate better performance. See section 7a(2) for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

b. Quantile regression

We explore the use of quantile regression to predict values z so that “there is a 90% chance that the average temperature will be below z° at your location next month”—or, equivalently, “there is a 10% chance that the average temperature will exceed z° at your location next month.” In this sense, quantile regression focused on the 90th percentile predicts temperature and precipitation extremes, a task highly relevant to many stakeholders. We train a linear regression model fitting the quantile loss (Linear QR), a RFQR, (Meinshausen 2006), a U-Net, and the stacked model. The Linear QR and the RFQR details are discussed in section 5. The below experimental results show that temperature extremes can be predicted with high accuracy by the learning-based models (particularly our stacked model), in stark contrast to historical quantiles or ensemble quantiles in the case of temperature quantile regression. The results for precipitation are less striking overall, though the learned models are significantly more predictive in some locations on this quantile regression task.

1) Quantile regression of precipitation

For each location, the 90th percentile value is calculated based on the historical data. For the ensemble 90th percentile, we simply take the 90th percentile of the K ensemble members. Table 5 summarizes results for precipitation quantile regression using the NCEP-CFSv2 ensemble. Our stacked model is able to significantly outperform all baselines. The performance illustration is given in appendix A, Fig. A3.

Table 5.

Test results for precipitation quantile regression using NCEP-CFSv2 dataset, with target quantile = 0.9. Linear QR refers to a linear quantile regressor. RFQR corresponds to a random forest quantile regressor. Model stacking is performed on models that are trained on all features. The best results are in bold. Quantile loss is reported in millimeters.

Table 5.

2) Quantile regression of temperature

Table 6 summarizes results for temperature quantile regression using the NCEP-CFSv2 ensemble. Note that we do not include SST features for temperature quantile regression in our learned models. We observe that all of our learned models are able to significantly outperform all baselines. In Fig. 4, we show the heatmaps of quantile loss of baselines and our learned models. We observe that the learned models produce predictions with varied quality, and the stacked model can pick up useful information from them. For example, in Arizona and Texas, the Linear QR, U-Net, and RFQR show some errors but in different locations, and the stacked model can exploit the advantages of each model.

Table 6.

Test results for temperature quantile regression using NCEP-CFSv2 dataset, with target quantile = 0.9. Linear QR refers to a linear quantile regressor. RFQR corresponds to a random forest quantile regressor. Model stacking is performed on models that are trained on all features except for SSTs. Learned models can predict highly likely temperature ranges accurately, meaning there are fewer unpredicted temperature spikes. The best results are in bold. Quantile loss is reported in degrees Celsius.

Table 6.
Fig. 4.
Fig. 4.

Test quantile loss heatmaps of baselines and learning-based methods for temperature quantile regression using NCEP-CFSv2 dataset. Blue regions indicate smaller quantile loss. See section 7b(2) for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

8. Discussion

a. The efficacy of machine learning for SSF

Several hypotheses might explain why ML may be a promising approach for SSF, and we probe those hypotheses in this section.

1) Using full ensemble vs ensemble mean

Past works use ensemble mean as an input feature to machine learning methods in addition to the climate variables (Hwang et al. 2018; He et al. 2021). Ensembles provide valuable information not only about expected climate behavior but also about variance or uncertainty in multiple dimensions; methods that rely solely on ensemble mean lack information about this variance. Ensemble members may have systematic errors, either in the mean or the variability, arising from different initial conditions of the corresponding dynamic model that are not readily apparent to users. The more recently initialized an ensemble member is, the better it usually performs. While taking the average of these ensemble members may cancel out the deficiencies of each individual member, it is also possible that details of each member’s systematic errors may be directly discovered and corrected independently by a machine learning model. Therefore, using a single ensemble statistic, such as the ensemble mean, as a feature may not fully capitalize on the information provided by using all members of the lagged ensemble as features.

In our experiments, we find that using all available ensemble members enhances the prediction quality of our approaches. As an illustration, we show the results of the LR, RF, U-Net, and the stacked model trained on all ensemble members, compared to the ML models trained on the ensemble mean only. In addition to the full ensemble or the ensemble mean, we use other available features (as in our previous regression results). Tables 7 and 8 demonstrate the precipitation and temperature forecasting results. For the linear regression, utilizing the ensemble mean with all other features produces the best test performance compared to using the full ensemble. Such behavior is not surprising for the LR since the full ensemble incorporates large variance across ensemble members, which may result in a worse linear fit. For the U-Net, RF, and stacked model, we observe significant performance improvements, in terms of having at least one standard error smaller MSE, when using the full ensemble instead of using the ensemble mean. When we compare the performance of learned models using only the ensemble mean to that of the learned models that use both the ensemble mean and the ensemble standard deviation for each spatial location, we find that the addition of the standard deviation feature does not provide enough information to significantly improve the performance of ML models and, in fact, the U-Net that exhibits a performance degradation—a potential a sign of overfitting. These observations are visually supported by Figs. 5 and 6, where the R2 heatmaps of our methods (except the U-Net) utilizing ensemble mean and standard deviation closely resemble the performance of methods solely relying on ensemble mean. We conclude that the full ensemble contains important information for SSF aside from the ensemble mean, and our models can capitalize on this information for precipitation and temperature forecasting.

Table 7.

Precipitation forecasting performance comparison of the LR, RF, U-Net, and stacked model trained using the ensemble mean, using the sorted ensemble members, or using the original ensemble, in addition to other features. Scores on the test data are reported, and NCEP-CFSv2 data are used. The best results are in bold. MSE is reported in squared millimeters.

Table 7.
Table 8.

Temperature forecasting performance comparison of the LR, RF, U-Net, and stacked model trained using the ensemble mean, using the sorted ensemble members, or using the original ensemble, in addition to other features. Scores on the test data are reported and NCEP-CFSv2 data are used. The best results are in bold. MSE is reported in squared degrees Celsius.

Table 8.
Fig. 5.
Fig. 5.

Precipitation regression test R2 heatmaps of LR, U-Net, RF, and stacked model trained using ensemble mean only, using sorted and shuffled ensemble, or using the full ensemble. The NCEP-CFSv2 ensemble is used. See section 8a for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

Fig. 6.
Fig. 6.

Temperature regression test R2 heatmaps of LR, U-Net, RF, and stacked model trained using ensemble mean only, using sorted and shuffled ensemble, or using the full ensemble. The NCEP-CFSv2 ensemble is used. See section 8a for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

We can perform a statistical test to verify that the performance discrepancies between using the ensemble mean and using the full ensemble are statistically significant for the stacked model. As before, let y^t,l refer to the estimate under our usual stacked model (i.e., with all ensemble members). Let y^t,lSEA refer to a stacked model with just the ensemble mean as a feature, instead of all ensemble members. We can employ a sign test framework (DelSole and Tippett 2014; Cash et al. 2019) to compare model performance under minimal distributional assumptions. Namely, we only make the following i.i.d. assumption over the time dimension:
I{|y^t,lyt,l|<|y^t,lSEAyt,l|}iidBernoulli(pl).
Intuitively, this corresponds to assuming it is a coin flip which model will perform better at each time point and location, and we would like to test whether each location’s “coin” is fair or not. We can then formulate our null and alternative hypotheses for each location l as follows:
H0,l:pl=0.5,H0,l:pl>0.5.
Thus, our overall test for significance is for the global null hypothesis H0=l=13274H0,l. We calculate a p value for each H0,l, and then we check whether any of these p values are below a Bonferroni-corrected threshold of 0.05/3274 = 1.53 × 10−5, where 3274 refers to the number of locations. In fact, the minimum p values for this test with precipitation and regression alike are far below this threshold (1.68 × 10−10 and 4.42 × 10−24, respectively). This allows us to reject the global null hypothesis for both temperature and precipitation, and we conclude that including the full ensemble in our stacked model significantly outperforms including just the ensemble mean.

2) Sensitivity to ensemble formulation

We consider the hypothesis that there is a set of k ensemble members that are always best. To test this hypothesis, we use a training period to identify which k members perform best for each location, and then during the test period, we compute the average of only these k ensemble members. The performance of this approach depends on k, the number of ensemble members we allow to be designated “good.” We have not found that the performance for any k exhibits a significant improvement over the ensemble mean.

If the ensemble members have different levels of accuracy over various seasons, locations, and conditions, then a machine learning model may be learning when to “trust” each member. We know that our ensemble members are lagged, meaning they are initialized at different times. We believe each ensemble member encapsulates valuable information derived from the underlying physical model during each initialization. To investigate the impact of ensemble member order, we perform the following experiment: we randomly permute ensemble members at every time step t for all locations (preserving the spatial information) and apply our ML models to these shuffled ensembles. From Tables 7 and 8, this approach negatively affects the performance of the ML models compared to using the full ensemble with the original order. One possible explanation is that the learned models lose the ability to learn which ensemble member to trust, as this information is tied to the initialization time of each ensemble member. Even though the spatial information remains intact after the shuffling, the models can no longer exploit dependencies associated with the original ensemble structure.

Additionally, we conduct an experiment designed to test whether it is important to keep track of which ensemble member made each prediction or whether it is the distribution of predictions that is important. The modeling approach for the former would be to feed in ensemble member 1’s forecast as the first feature, ensemble member 2’s forecast as the second feature, etc. The modeling approach under the distributional hypothesis is to make the smallest prediction be the first feature, the second-smallest prediction be the second feature, and so on—i.e., we sort the ensemble forecasts for each location separately. Note that this entails treating the ensemble members symmetrically: the model would give the same prediction if ensemble member 1 predicted a and ensemble member 2 predicted b or if ensemble member 1 predicted b and ensemble member 2 predicted a. In statistical parlance, this is passing in the order statistics of the forecasts as the features rather than their original ordering. Note that for NCEP-CFSv2, ensemble forecasts are originally ordered according to the time their initial conditions are set (Saha et al. 2014). According to Tables 7 and 8, using the sorted ensemble drastically degrades U-Net’s performance, which is essentially because we sort the ensemble members for each location individually, and sorting the ensemble members individually for each location may hamper the ability for the U-Net to learn spatial structure. In the case of precipitation regression with the stacking model from Table 7, the MSE of the sorted approach is 2.24, which is worse than the 2.07 MSE for using the original ordering. In the case of temperature forecasting from Table 8, the MSE of the sorted approach is 3.70, which is much worse than the 3.11 MSE for using the original ordering. The mean R2 of the sorted approach is also lower compared to the original ordering. In both cases, the performance is better when we feed in the features in such a way that the machine learning model has an opportunity to learn aspects of each ensemble member, not merely their order statistics. Therefore, imposing a symmetric treatment of ensemble members degrades performance. Figures 5 and 6 show the corresponding R2 heatmaps of our models for precipitation and temperature regression tasks.

b. Using spatial data

There are several ways to incorporate information about location in our models. U-Net has access to spatial dependencies through its design. Specifically, our U-Net inputs the spatial location of each point in the map. Naively, we might represent each location using the latitude and longitude values. Alternatively, we may use positional encoding, which is known to be beneficial in many ML areas, not only in NLP (as we mention in section 6a). PE captures the order (or position) and allows one to learn the contextual relationships (local context—relationships between nearby elements and global context dependencies across the entire sequence). We assume that the PE approach represents spatial information in a manner more accessible to our learned models.

As an illustration, Table 9 and Fig. 7 demonstrate the performance of a stacked model using LR, RF, and U-Net trained using positional encoding, using latitude and longitude values, and using no features representing the spatial information (no PE and no latitude or longitude values). Other inputs to the LR, RF, and U-Net models are ensemble member forecasts, lagged target variable, climate variables, and SSTs (except in the case of temperature forecasting). The results suggest that using PE enhances the predictive skill of our models, compared to using just the latitude/longitude values or no location information, especially for the temperature forecasting task. Using no information about locations hurts the performance of precipitation regression. Thus, our models can account for spatial dependencies using input features, and PE is more beneficial than the raw latitude and longitude information. These findings on PE effectiveness are consistent with prior findings in ML. For example, Wu et al. (2021) investigate the efficacy of PE in the context of the visual transformer model used for image classification and object detection. We show a more detailed analysis with results for the LR, RF, and U-Net in appendix C, section a.

Table 9.

Test performance comparison of the stacked model of LR, RF, and U-Net trained using no spatial features, using latitude and longitude values, or using PE. Utilizing spatial representations, including PE, latitude, and longitude values, helps advance the predictive skill. Furthermore, using positional encoding is more beneficial than using raw latitude and longitude values. The best results are in bold. MSE is reported in squared millimeters for precipitation and in squared degrees Celsius for temperature.

Table 9.
Fig. 7.
Fig. 7.

Test R2 heatmaps of the stacked model of LR, RF, and U-Net trained using no spatial features, using latitude and longitude values, or using PE. The NCEP-CFSv2 ensemble is used. See section 8b for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

c. Variable importance

One consideration when implementing ML for SSF is that ML models can incorporate side information (such as spatial information, lagged temperature and precipitation values, and climate variables). In this section, we explore the importance of the various components of side information. We see that including the observational climate variables improves the performance for the random forest and the U-Net when doing precipitation regression. Furthermore, including positional encoding of the locations improves the performance of the U-Net, while the principle components of the sea surface temperature do not make a notable difference in the case of temperature prediction.

More specifically, Table 10 summarizes grouped feature importance of precipitation regression using the NCEP-CFSv2 ensemble. We observe that models, in particular random forest and U-Net, trained on all available data achieve the best performance. In the case of linear regression, the SST features are neither very helpful nor actively harmful. Therefore, to be consistent, we use predictions of these models trained on all features as input to the stacking model.

Table 10.

Grouped feature importance results on validation for precipitation regression task using NCEP-CFSv2 ensemble members. The results suggest that using additional observational information helps to improve the performance of learning-based models for this task. –”– means a repetition of features that are used above. For example, in the U-Net part of the table, “–”– & lags” means that ensemble members, PE, and lags are used as features and “–”– & SSTs” means ensemble members, PE and lags, land features and SSTs are used as features. The best results are in bold. MSE is reported in squared millimeters.

Table 10.

Table 11 summarizes grouped feature importance of temperature regression using the NCEP-CFSv2 ensemble. In this case, adding some types of side information may yield only very small improvements to predictive skill, and in some cases, the additional information may decrease predictive skill. On the one hand, this effect can be explained by different training set sizes for different models: as we outline in section 5d, the training set size for RF is n = TL, while n = T for U-Net. This effect also may be a sign of overfitting, as temperature forecasting presents a comparatively less complex challenge than precipitation forecasting. We also note that SSTs provide only marginal (if any) improvement in predictive skill, in part because Pacific SSTs are less helpful away from the western United States (Mamalakis et al. 2018; Seager et al. 2007). It could also be that information from the SSTs is already being well captured by the output from the dynamical models, and thus, including observed SSTs does not provide much additional information. In order to be consistent, we use predictions of these models trained on all features except SSTs as input to the stacking model.

Table 11.

Grouped feature importance results on validation for temperature regression task using NCEP-CFSv2 ensemble members. The results demonstrate that using some additional information may yield only very small improvements in predictive skill, and in some cases, the side information may decrease predictive skill. –”– means a repetition of features that are used above. For example, in the U-Net part of the table, “–”– & lags” means that ensemble members, PE, and lags are used as features and “–”– & SSTs” means ensemble members, PE and lags, land features and SSTs are used as features. The best results are in bold. MSE is reported in squared degrees Celsius.

Table 11.

9. Conclusions and future directions

This paper systematically explores the use of machine learning methods for subseasonal forecasting, highlighting several important factors: 1) the importance of using ensembles of physics-based forecasts (as opposed to only using the mean, as in common practice); 2) the potential for forecasting temperature and precipitation extremes using quantile regression; 3) the efficacy of different mechanisms, such as positional encoding and convolutional neural networks, for modeling spatial dependencies; 4) the importance of various features, such as sea surface temperature and lagged temperature and precipitation values, for predictive accuracy; and 5) model stacking provides substantial benefits by leveraging the different utilization of spatial data among contributing models. The stacking model probably capitalizes on this diversity, fostering performance enhancement. Together, these results provide new insights into using ML for subseasonal weather forecasting in terms of the selection of features, models, and methods.

Our results also suggest several important directions for future research. In terms of features, there are many climate forecasting ensembles computed by organizations such as NOAA and ECMWF. This paper focuses on ensembles in which ensemble members have a distinct ordering (in terms of lagged initial conditions used to generate them), but other ensembles correspond to initial conditions or parameters drawn independently from some distribution. Leveraging such ensemble forecasts and potentially jointly leveraging ensemble members from multiple distinct ensembles may further improve the predictive accuracy of our methods.

In terms of models, new neural architecture models such as transformers show remarkable performance on several image analysis tasks (Dosovitskiy et al. 2020; Carion et al. 2020; Chen et al. 2021; Khan et al. 2022) and have potential in the context of forecasting climate temperature and precipitation maps. A careful study is needed, as past image analysis work using transformers generally uses large quantities of training data, exceeding what is available in SSF contexts. Recent advancements in data-driven global weather forecasting models, such as Pangu-Weather (Bi et al. 2022), FourCastNet (Pathak et al. 2022), and GenCast (Price et al. 2023), demonstrate the potential of ML techniques to enhance forecasting capabilities across various time scales. These models outperform traditional numerical weather prediction approaches, suggesting that similar data-driven methods may hold promise for improving SSF quality.

In terms of methods, two outstanding challenges are particularly salient to the SSF community. The first is uncertainty quantification; that is, we wish not only to forecast temperature or precipitation but also to predict the likelihood of certain extreme events. Our work on quantile regression is an important step in this direction, and statistical methods like conformalized quantile regression (Romano et al. 2019) may provide additional insights. Second, we see in Fig. C7 that, at least in some geographic regions, the distribution of ensemble hindcast and forecast data may be quite different. Employing methods that are more robust to distribution drift (Wiles et al. 2021; Subbaswamy et al. 2021; Zhu et al. 2021) is particularly important not only for handling forecast and hindcast data but also for accurate SSF in a changing climate.

Acknowledgments.

The authors gratefully acknowledge the support of the NSF (OAC-1934637, DMS-1930049, and DMS-2023109) and C3.ai.

Data availability statement.

All data used in the experiments are publicly available. NOAA data for historical averages are available at https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/national/time-series. Other climate variables (tmp2m, precip, slp, hgt500, rhum, and SSTs) are also publicly available at https://sites.google.com/view/ssf-dataset. The NCEP-CFSv2 ensemble members are available at https://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/.NCEP-CFSv2/. The NASA-GMAO ensemble members are available at https://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/.NASA-GMAO/. Our code is available at https://github.com/elena-orlova/SSF-project.

APPENDIX A

Regression Results for NASA-GMAO and NCEP-CFSv2

a. Regression

1) Precipitation regression using NASA-GMAO

Precipitation regression results on the test data from NASA-GMAO are presented in Table A1. In this dataset, no learned method or method leveraging ensemble model forecasts significantly outperforms climatology. Note that the best R2 value associated with the climatology is still negative; the low values for all methods indicate the difficulty of the forecasting problem.

Table A1.

Test results for precipitation regression using NASA-GMAO dataset. LR refers to linear regression on all features including ensemble members, lagged data, land variables, and SSTs. Model stacking is performed on models that are learned on all features. Bold values indicate the best performance for each statistic. MSE is reported in squared millimeters.

Table A1.

Figure A1 illustrates the test performance of key methods on NASA-GMAO data with R2 heatmaps over the United States. Although the stacked model does not show the best performance in terms of mean R2 score, it has more geographic regions with positive R2 than any other method.

Fig. A1.
Fig. A1.

Test R2 score heatmaps of baselines and learning-based methods for precipitation regression using the NASA-GMAO dataset. Positive values (blue) indicate better performance. See appendix A, section a(1) for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

2) Regression of temperature

Temperature regression results using NASA-GMAO ensemble members are presented in Table A2. The random forest and linear regression outperform all baselines in terms of both R2 score and MSE. However, the U-Net model’s performance is lower compared to other learned methods, which might be a sign of overfitting. Despite this performance drop of U-Net, the model stacking approach still demonstrates the best predictive skill. Note that the model stacking approach is applied to the models that are trained on all available features except SSTs (similar to NCEP-CFSv2 data).

Table A2.

Test results for temperature regression using NASA-GMAO dataset. LR refers to linear regression on all features including ensemble members, lagged data, and land variables. Model stacking is performed on models that are learned on all features except SSTs. Bold values indicate the best performance for each statistic. MSE is reported in squared degrees Celsius.

Table A2.

Figure A2 illustrates the test performance of key methods on NASA-GMAO data with R2 heatmaps over the United States. The stacked model shows the best performance across spatial locations. Similar to the NCEP-CFSv2 dataset, we notice that there are still regions where all models tend to exhibit a negative R2 score.

Fig. A2.
Fig. A2.

Test R2 score heatmaps of baselines and learning-based methods for temperature regression using NASA-GMAO dataset. Positive values (blue) indicate better performance. See appendix A, section a(2) for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

b. Quantile regression

1) Quantile regression of precipitation using NCEP-CFSv2 ensemble

In Fig. A3, we show heatmaps of quantile loss using all locations in the United States, where blue means smaller quantile loss and yellow means larger quantile loss. We observe that the learning-based models outperform the baselines, especially in Washington, California, Idaho, and near the Gulf of Mexico.

Fig. A3.
Fig. A3.

Test quantile loss heatmaps of baselines and learning-based methods for precipitation quantile regression using NCEP-CFSv2 dataset. Blue regions indicate smaller quantile loss. See section 7b(1) for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

Fig. A4.
Fig. A4.

Test quantile loss heatmaps of baselines and learning-based methods for precipitation quantile regression using NASA-GMAO dataset. Blue regions indicate smaller quantile loss. See appendix A, section b(2) for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

Fig. A5.
Fig. A5.

Test quantile loss heatmaps of baselines and learning-based methods for temperature quantile regression using NASA-GMAO dataset. Blue regions indicate smaller quantile loss. See appendix A, section b(3) for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

2) Quantile regression of precipitation using NASA-GMAO ensemble

Table A3 summarizes results for precipitation quantile regression using the NASA-GMAO ensemble. The models are the same as the models applied to the NCEP-CFSv2 ensemble. Our best model shows similar performance to that of the baselines. One possible reason is that the NASA-GMAO ensemble shows worse performance than the NCEP-CFSv2 ensemble empirically. Furthermore, according to the designs of ensemble members from both climate models, the NASA-GMAO ensemble has fewer ensemble members and, therefore, has less coverage on the distribution of precipitation than the NCEP-CFSv2 ensemble, so our learned model has access to less information about the true distribution of precipitation.

Table A3.

Test results for precipitation quantile regression using NASA-GMAO dataset, with target quantile = 0.9. Model stacking is performed on models that are learned on all features. The best results are in bold. Quantile loss is reported in millimeters.

Table A3.

3) Quantile regression of temperature

Table A4 and Fig. 5 summarize results for temperature quantile regression using the NASA-GMAO ensemble. All of our learned models are able to outperform all baselines.

Table A4.

Test results for temperature quantile regression using NASA-GMAO dataset, with target quantile = 0.9. Model stacking is performed on models that are learned on all features except for SSTs. The best results are in bold. Quantile loss is reported in degrees Celsius.

Table A4.

APPENDIX B

Tercile Classification

In this section, we present results for the tercile classification task for both climate variables and both datasets.

a. Tercile classification of precipitation

In this case, the proposed learning-based methods are directly trained on the classification task. Predictions of baselines, such as the ensemble mean, are split into three classes according to the 33rd and 66th percentile values. Note that random forest and U-Net are trained for classification using all available features. We do not notice a significant difference in the performance of logistic regression on the validation if the inputs are ensemble members only or ensemble members with side information. So, we use logistic regression on ensemble members only. The model stacking is applied to the logistic regression, U-Net, and random forest outputs.

Table B1 summarizes results for NCEP-CFSv2 and NASA-GMAO datasets on the test data. For this task, the learning-based methods achieve the best performance in terms of accuracy for both datasets. In the case of NCEP-CFSv2 data, U-Net achieves the highest accuracy score, and the performance of the stacked model is comparable with it. For NASA-GMAO data, the stacked model shows the best performance.

Table B1.

Test results for tercile classification of precipitation on different datasets. Accuracy in percentage is reported. Note that for this task, our models are trained for classification directly while baselines perform regression, and a threshold for predicted values is applied. For stacking, logistic regression, U-Net, and RF outputs are used.

Table B1.

The accuracy heatmaps over U.S. land are presented in the Fig. B1 for NCEP-CFSv2 dataset. The plots corresponding to the learning-based methods show the best results, especially at the West Coast, Colorado, and North America.

Fig. B1.
Fig. B1.

Test accuracy heatmaps of baselines and learning-based methods for tercile classification of precipitation using NCEP-CFSv2 dataset. The accuracy colorbar is recentered to be white at 1/3, what corresponds to a random guess score. Blue pixels indicate better performance, while red pixels correspond to performance that is worse than a random guess. See appendix B, section a for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

The accuracy heatmaps over the U.S. land are presented in Fig. B2 for the NASA-GMAO dataset. The plots corresponding to the learning-based methods show the best results, and the ensemble mean’s figure has the most red regions.

Fig. B2.
Fig. B2.

Test accuracy heatmaps of baselines and learning-based methods for tercile classification of precipitation using NASA-GMAO dataset. The accuracy colorbar is recentered to be white at 1/3, what corresponds to a random guess score. Blue pixels indicate better performance, while red pixels correspond to performance that is worse than a random guess. See appendix B, section a for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

b. Tercile classification of temperature

The next task is tercile classification of 2-m temperature. In this case, the threshold is applied to the regression predictions of all methods, meaning there is no direct training for a classification. Table B2 summarizes results for NCEP-CFSv2 and NASA-GMAO datasets on the test data. For this task, the learning-based methods achieve the best performance in terms of accuracy, stacked model using NCEP-CFSv2 data and linear regression using NASA-GMAO data, and all additional features (except SSTs). In general, all learning-based models significantly outperform the ensemble mean.

Table B2.

Test results for tercile classification of temperature on different datasets. Accuracy in percentage is reported. Note that for this task, our models are trained for regression and the threshold for predicted values is applied.

Table B2.

Figure B3 shows accuracy heatmaps over the United States for different methods using NCEP-CFSv2 data. The stacked model shows the best performance across spatial locations. For example, the ensemble mean does not show great performance in the Southeast and Middle Atlantic regions, while learning-based methods demonstrate much stronger predictive skills in these areas. However, there are still some areas, such as Texas or South West region, with red pixels for all methods.

Fig. B3.
Fig. B3.

Test accuracy heatmaps of baselines and learning-based methods for tercile classification of temperature using NCEP-CFSv2 dataset. The accuracy color bar is recentered to be white at 1/3, what corresponds to a random guess score. Blue pixels indicate better performance, while red pixels correspond to performance that is worse than a random guess. See appendix B, section b for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

Figure B4 shows accuracy heatmaps over the United States for different methods using NASA-GMAO data. In this case, linear regression on all features achieves the best scores. Other learning-based methods outperform the ensemble mean too, especially in the West and in Minnesota.

Fig. B4.
Fig. B4.

Test accuracy heatmaps of baselines and learning-based methods for tercile classification of temperature using NASA-GMAO dataset. The accuracy colorbar is recentered to be white at 1/3, what corresponds to a random guess score. Blue pixels indicate better performance, while red pixels correspond to performance that is worse than a random guess. See appendix B, section b for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

APPENDIX C

Extended Discussion

In this section, we present more detailed results for section 8. We also provide additional experiments on the temperature forecasting analysis and experiments on training set sizes and bootstrap.

a. PE vs latitude/longitude values or no location information

In this section, we elaborate on our experiment with different uses of location information. Similar to section 8b, we train and test our models with three different settings: using no location information, using latitude/longitude values, or using positional encodings. For the stacked model, we first train the LR, RF, and U-Net using these different settings before training the stacked model using the corresponding LR, RF, and U-Net outputs. Table C1 summarizes test performance for precipitation regression with these three settings of using location information. For linear regression, we observe that having latitude/longitude values or adding PE features does not improve its performance. One interesting result is that adding PE for the LR degraded its performance, which may be due to the fact that the PE features are nonlinear transformations of latitude and longitude values, which is hard to fit with a linear model.

Table C1.

Precipitation regression test performance comparison of LR, U-Net, RF, and stacked model trained using no spatial features, using latitude and longitude values, or using PE. The best results are in bold.

Table C1.

For the RF, U-Net, and the stacked model, we observe that adding PE improves their performance with significance, i.e., having at least one standard error smaller MSE. For the U-Net, using latitude/longitude values yields worse overall performance compared to using no location information. Figure C1 shows the test R2 heatmaps for LR, U-Net, RF, and stacked model under these three settings for precipitation regression. We observe that adding PE features not only improves performance for the U-Net and RF but the stacked model’s performance also improves from having better predictions from the U-Net and RF.

Fig. C1.
Fig. C1.

Precipitation regression test R2 heatmaps of LR, U-Net, RF, and stacked model trained using no spatial features, using latitude and longitude values, or using PE. The NCEP-CFSv2 ensemble is used. See appendix C, section a for more details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

Table C2 shows the test performances of LR, U-Net, RF, and stacked model on temperature regression. Similar to precipitation regression, adding latitude/longitude values or adding PE does not help the LR, but we observe significant performance improvement when adding positional encoding features for the U-Net, RF, and the stacked model. Figure C2 then shows the test R2 heatmaps for temperature regression. We can see that adding PE to the U-Net, RF, and stacked model improves forecast performance, especially in regions like Arizona, New Mexico, and Texas.

Table C2.

Temperature regression test performance comparison of LR, U-Net, RF, and stacked model trained using no spatial features, using latitude and longitude values, or using PE. The best results are in bold.

Table C2.
Fig. C2.
Fig. C2.

Temperature regression test R2 heatmaps of LR, U-Net, RF, and stacked model trained using no spatial features, using latitude and longitude values, or using PE. The NCEP-CFSv2 ensemble is used. See appendix C, section a for more details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

b. Bootstrap experiments

To evaluate the stability of our machine learning models with small sample sizes, we perform the following bootstrap experiments: We take bootstrap samples of size 200 from our training set and retrain our U-Net, RF, and LR. Then, we evaluate these models on the test set. We repeat this process 50 times and show the results in Fig. C3. We observe from the plots that the U-Net performs consistently better than the LR in precipitation regression but not for temperature regression. This result is consistent with what we have shown in Tables 3 and 4.

Fig. C3.
Fig. C3.

Box plots of MSEs for the U-Net, LR, and RF trained on 50 set of different bootstrap samples, each with size 200. The NCEP-CFSv2 ensemble is used. See appendix C, section b for more details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

We also observe that the U-Net is more sensitive to different bootstrap samples than the RF and LR, which is not surprising since for the U-Net, the bootstrap samples correspond to 200 different spatial maps for training. In contrast, for the RF and LR, the bootstrap samples correspond to 200 × 3274 training samples.

c. Precipitation forecast example

While climate simulations and ensemble forecasts are designed to provide useful predictions of temperature and precipitation based on carefully developed physical models, we see that machine learning applied to those ensembles can yield a significantly higher predictive skill for a range of SSF tasks. Figure C4 illustrates key differences between different predictive models for predicting monthly precipitation with a lead time of 14 days. Individual ensemble members are predictions with high levels of spatial smoothness and more extreme values. Linear regression, the random forest, the U-Net, and the stacked model produce higher spatial frequencies. The linear regression result, which uses a different model trained for each spatial location separately, has the least spatial smoothness of all methods; this is especially visible in the southeast and potentially does not reflect realistic spatial structure. The learning-based models more accurately predict localized regions of high and low precipitation compared to the ensemble mean.

Fig. C4.
Fig. C4.

An illustration of precipitation predictions y^t,lanomaly (in mm) of different methods for February 2016 (in test period). (a) True precipitation. (b) LR on ensemble members. (c) Climatology. (d) LR on all features. (e) Ensemble mean. (f) U-Net on all features. (g) Example single ensemble member. (h) Random forest on all features. (i) Stacked model. See appendix C, section c for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

Figure C5 demonstrates differences between the ground truth and different model predictions. In this figure, the color white is associated with the smallest errors, while red pixels indicate overestimating precipitation and blue pixels indicate underestimating precipitation. The individual ensemble member in Fig. C5e exhibits dark red regions across the west, while the ensemble mean in Fig. C5e shows better performance in this area. The colors are more muted for the stacked model in Fig. C5h. The climatology in Fig. C5a has the most neutral areas. However, its MSE is slightly higher than the stacked model’s MSE. In general, all methods, including linear regression (Figs. C5b,d), U-Net (Fig. C5f), and random forest (Fig. C5g), tend to underpredict precipitation in the Southeast, Mid-Atlantic, and North Atlantic and predict higher precipitation levels in the west.

Fig. C5.
Fig. C5.

An illustration of differences yt,lanomalyy^t,lanomaly in precipitation predictions in millimeter of different methods for February 2016 (in test period). Red pixels indicate areas where a forecasting method predicts higher precipitation levels compared to the ground truth, blue pixels indicate an underestimation of the precipitation, and white pixels correspond to a precise prediction. See appendix C, section c for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

d. Temperature forecasting analysis

Figure 3 shows regions in Texas and Florida where the ensemble mean and linear regression performance is poor, while a random forest achieves far superior performance. We conduct an analysis of forecasts of the ensemble mean, linear regression, and random forests in these regions together with a region in Wisconsin where all methods show good performance. Figure C6 indicates these regions and Table C3 summarizes the performance of different methods in these regions: the ensemble mean prediction quality dramatically drops between the validation and test periods in Texas and Florida, which is not the case for the random forest.

Fig. C6.
Fig. C6.

Regions where the temperature forecast is analyzed. See appendix C, section d for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

Table C3.

Train, validation, and test performance of different methods in Texas, Florida, and Wisconsin regions. The task is temperature regression; NCEP-CFSv2 dataset is used. The performance of the ensemble mean and linear regression in the test period significantly decreases in Texas and Florida while the random forest is able to demonstrate reasonable results. All methods perform well in Wisconsin.

Table C3.

Why does RF perform so much better than simpler methods in some regions? One possibility is that the RF is a nonlinear model capable of more complex predictions. However, if that were the only cause of the discrepancy in performance, then we would expect that the RF would be better not only during the test period but also during the validation period. Table C3 does not support this argument; it shows that the ensemble mean and linear regression have comparable, if not superior, performance to the random forest during the validation period. A second hypothesis is that the distribution of temperature is different during the test period than during the training and validation periods. This hypothesis is plausible for two reasons: 1) climate change and 2) the training and validation data use hindcast ensembles while the test data use forecast ensembles. To investigate this hypothesis, in Fig. C7, we plot the true temperature and ensemble mean in the training, validation, and test periods for the three geographic regions. The discrepancy between the true temperatures and ensemble means in the test period is generally greater than during the training and validation periods in Texas and Florida (though not in Wisconsin, a region where validation and test performance are comparable for all methods). This lends support to the hypothesis that hindcast and forecast ensembles exhibit distribution drift, and the superior performance of the RF during the test period may be due to a greater robustness to that distribution drift.

Fig. C7.
Fig. C7.

Temperature predictions in degrees Celsius of different methods at Texas, Florida, and Wisconsin regions. Black lines correspond to train/val and val/test splits; train and validation correspond to the hindcast regime of the ensemble, while test corresponds to the forecast regime. See appendix C, section d for details.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

The hindcast and forecast ensembles may have different predictive accuracies because the hindcast ensembles have been debiased to fit past observations—a procedure not possible for forecast data. To explore the potential impact of debiasing, Fig. C7 shows the “oracle debiased ensemble mean,” which is computed by using the test data to estimate the forecast ensemble bias and subtracting it from the ensemble mean. This procedure, which would not be possible in practice and is used only to probe distribution drift ensemble bias, yields smaller discrepancies between the true data and the (oracle debiased) ensemble mean than the discrepancies between the true data and the original (biased) ensemble mean. Specifically, the oracle ensemble member achieves −0.20 mean R2 score (TX) and −0.28 mean R2 score (FL) vs −1.55 R2 (TX) and −0.87 R2 (FL) of the original forecast ensemble mean. The errors during the test period are generally larger than during the train and validation period, even after debiasing the ensemble members using future data. This effect may be attributed both to 1) the nonstationarity of the climate (note that there are more extreme values during the test period than during the training and validation periods, particularly in Texas and Florida) and 2) the fact that in the train and validation periods, we use hindcast ensemble members, whereas in the test period, we use forecast ensemble members.

APPENDIX D

Architecture Details

a. Machine learning architectures

1) U-Net details

The U-Net has residual connections from layers in the encoder part to the decoder part in a paired way so that it forms a U shape. Figure D1 shows the architecture of the U-Net. The U-Net is a powerful deep convolutional network that is widely used in image processing tasks such as image segmentation (Ronneberger et al. 2015; Hao et al. 2020) or style transfer (Gatys et al. 2016; Jing et al. 2020).

Fig. D1.
Fig. D1.

U-Net architecture with input channels = C. The C is the number of input channels, which, in our case, equals the number of ensemble members plus all climate data.

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

Our U-Net differs from the original U-Net by modifying the first 2D convolutional layer after input. Since our input channels can be different when we choose a different subset of features or different ensemble (NCEP-CFSv2 or NASA-GMAO), this 2D convolutional layer is used to transform our input with C channels into a latent representation with 64 channels. The number of channels C depends on which ensemble we are using and what task we are performing. For example, for precipitation tasks using the NCEP-CFSv2 ensemble, the input channels include 24 ensemble members, 5 lagged observations, 4 other observational variables, 8 principal components of SSTs, and 24 positional encodings, resulting in 65 channels in total. For temperature tasks using the NASA-GMAO ensemble, there are only 11 ensemble members and we do not include SSTs information; hence, there are only 44 channels in total. The other following layers use the same configurations with the standard U-Net Ronneberger et al. (2015).

We also perform careful hyperparameter tuning for the U-NET. In particular, we run a 10-fold cross validation on our training set and use grid search for tuning learning rate, batch size, number of epochs, and weight decay. Since we use different loss functions for different forecast tasks and different numbers of input channels for NCEP-CFSv2 and GMAO-GMAO ensemble, we run hyperparameter tuning with the same cross-validation scheme separately for these tasks. For instance, for precipitation regression, we choose from 100, 120, 150, 170, 200, and 250 epochs; batch size may be equal to 8, 16, and 32; learning rate values are chosen from 0.0001, 0.001, and 0.01; weight decay can be 0, 1 × 10−3, and 1 × 10−4. In case of NCEP-CFSv2 precipitation regression, the optimal parameters are 170 epochs, batch size 16, learning rate 0.0001, and weight decay 1 × 10−4. For temperature regression using the same data, the best parameters are 100 epochs, batch size 16, learning rate 0.001, and weight decay 1 × 10−3. For tercile classification of precipitation, the best parameters are 80 epochs (we chose from 60, 70, 80, 90, and 100 epochs during classification), batch size 8, learning rate 0.001, and weight decay 1 × 10−4.

2) Random forest quantile regressor details

We show a figure representation of the RFQR in Fig. D2. The RFQR is essentially trained as a regular random forest, but it makes a quantile estimate by taking the sample quantile of the responses in all leaves associated with a new input.

Fig. D2.
Fig. D2.

Illustration of a random forest, which serves as a visual aid for our discussion. Quantile regression forest outputs the empirical α quantiles of the collection of all responses in all the leaves associated with xt,l (marked with a star for each tree).

Citation: Artificial Intelligence for the Earth Systems 3, 4; 10.1175/AIES-D-23-0103.1

3) Stacking model details

The stacking model is a simple one-layer neural network with 100 hidden neurons and a sigmoid activation function for regression and softmax for classification. We use an implementation from Scikit-learn library (Pedregosa et al. 2011). We choose 100 neurons based on the stacking model performance on the validation data (we also try 50, 75, 100, and 120 neurons). The stacking model demonstrates stable performance in general, but with 100 neurons, it usually achieves the best results. We use the “lbfgs” optimizer from quasi-Newton methods for the regression tasks and the SGD optimizer for classification tasks.

APPENDIX E

Additional Preprocessing Details

Random forest and U-Net require different input formats. For U-Net, all input variables have natural image representation except SSTs and information about location. For example, ensemble predictions can be represented as a tensor of shape (K, W, and H), where K corresponds to the number of ensemble members (or number of channels of an image), and W and H are width and height of the corresponding image. In our case, W = 64 and H = 128. For the U-Net model, we handle the missing land variables over the sea regions by the nearest neighbor interpolation of available values.

a. Sea surface temperatures

There are more than 100 000 SST locations available. We extract the top eight principal components. Principal component analysis fits on the train part and then is applied to the rest of the data. In the case of U-Net, we deal with PCs of SSTs by adding additional input channels that are constant across space, with each channel corresponding to one of PCs. Random forest can use PCs from SSTs directly with no special preprocessing.

b. Normalization

We apply channelwise min–max normalization to the input features at each location based on the training part of the dataset in the case of U-Net. As for normalization of the true values, min–max normalization is applied for precipitation and standardization is applied for temperature. This choice affects the final layer of the U-Net model, too: for the precipitation regression task, a sigmoid activation is used, and no activation function is applied for temperature regression. For the stacking model, we apply min–max normalization to both input and target values.

REFERENCES

  • Badr, H. S., B. F. Zaitchik, and S. D. Guikema, 2014: Application of statistical models to the prediction of seasonal rainfall anomalies over the Sahel. J. Appl. Meteor. Climatol., 53, 614636, https://doi.org/10.1175/JAMC-D-13-0181.1.

    • Search Google Scholar
    • Export Citation
  • Ban, R. J., and Coauthors, 2016: Next Generation Earth System Prediction: Strategies for Subseasonal to Seasonal Forecasts. The National Academies Press, 350 pp.

  • Barnston, A. G., M. K. Tippett, M. L. L’Heureux, S. Li, and D. G. DeWitt, 2012: Skill of real-time seasonal ENSO model predictions during 2002–11: Is our capability increasing? Bull. Amer. Meteor. Soc., 93, 631651, https://doi.org/10.1175/BAMS-D-11-00111.1.

    • Search Google Scholar
    • Export Citation
  • Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2022: Pangu-weather: A 3D high-resolution model for fast and accurate global weather forecast. arXiv, 2211.02556v1, https://doi.org/10.48550/arXiv.2211.02556.

  • Biau, G., and E. Scornet, 2016: A random forest guided tour. TEST, 25, 197227, https://doi.org/10.1007/s11749-016-0481-7.

  • Brunet, G., and Coauthors, 2010: Collaboration of the weather and climate communities to advance subseasonal-to-seasonal prediction. Bull. Amer. Meteor. Soc., 91, 13971406, https://doi.org/10.1175/2010BAMS3013.1.

    • Search Google Scholar
    • Export Citation
  • Carion, N., F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, 2020: End-to-end object detection with transformers. Computer Vision—ECCV 2020, Lecture Notes in Computer Science, Vol. 12346, Springer, 213–229.

  • Cash, B. A., J. V. Manganello, and J. L. Kinter III, 2019: Evaluation of NMME temperature and precipitation bias and forecast skill for South Asia. Climate Dyn., 53, 73637380, https://doi.org/10.1007/s00382-017-3841-4.

    • Search Google Scholar
    • Export Citation
  • Chen, H., and Coauthors, 2021: Pre-trained image processing transformer. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, Institute of Electrical and Electronics Engineers, 12 299–12 310, https://doi.org/10.1109/CVPR46437.2021.01212.

  • Chen, L., B. Han, X. Wang, J. Zhao, W. Yang, and Z. Yang, 2023: Machine learning methods in weather and climate applications: A survey. Appl. Sci., 13, 12019, https://doi.org/10.3390/app132112019.

    • Search Google Scholar
    • Export Citation
  • Cofiño, A. S., R. Cano, C. Sordo, and J. M. Gutiérrez, 2002: Bayesian networks for probabilistic weather prediction. 15th European Conf. on Artificial Intelligence (ECAI), Lyon, France, IOS Press, 695–700, https://www.researchgate.net/publication/220836968_Bayesian_Networks_for_Probabilistic_Weather_Prediction.

  • Cohen, J., D. Coumou, J. Hwang, L. Mackey, P. Orenstein, S. Totz, and E. Tziperman, 2019: S2S reboot: An argument for greater inclusion of machine learning in subseasonal to seasonal forecasts. Wiley Interdiscip. Rev.: Climate Change, 10, e00567, https://doi.org/10.1002/wcc.567.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill. Mon. Wea. Rev., 142, 46584678, https://doi.org/10.1175/MWR-D-14-00045.1.

    • Search Google Scholar
    • Export Citation
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova, 2018: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv, 1810.04805v2, https://doi.org/10.48550/arXiv.1810.04805.

  • Dosovitskiy, A., and Coauthors, 2020: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv, 2010.11929v2, https://doi.org/10.48550/arXiv.2010.11929.

  • Fan, Y., and H. van den Dool, 2008: A global monthly land surface air temperature analysis for 1948–present. J. Geophys. Res., 113, D01103, https://doi.org/10.1029/2007JD008470.

    • Search Google Scholar
    • Export Citation
  • Frnda, J., M. Durica, J. Rozhon, M. Vojtekova, J. Nedoma, and R. Martinek, 2022: ECMWF short-term prediction accuracy improvement by deep learning. Sci. Rep., 12, 7898, https://doi.org/10.1038/s41598-022-11936-9.

    • Search Google Scholar
    • Export Citation
  • Gamboa, J. C. B., 2017: Deep learning for time-series analysis. arXiv, 1701.01887v1, https://doi.org/10.48550/arXiv.1701.01887.

  • Gatys, L. A., A. S. Ecker, and M. Bethge, 2016: Image style transfer using convolutional neural networks. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, Institute of Electrical and Electronics Engineers, 2414–2423, https://doi.org/10.1109/CVPR.2016.265.

  • Gehring, J., M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, 2017: Convolutional sequence to sequence learning. Int. Conf. on Machine Learning, Sydney, New South Wales, Australia, PMLR, 1243–1252, https://dl.acm.org/doi/10.5555/3305381.3305510.

  • Ghaderi, A., B. M. Sanandaji, and F. Ghaderi, 2017: Deep forecast: Deep learning-based spatio-temporal forecasting. arXiv, 1707.08110v1, https://doi.org/10.48550/arXiv.1707.08110.

  • Grönquist, P., C. Yao, T. Ben-Nun, N. Dryden, P. Dueben, S. Li, and T. Hoefler, 2020: Deep learning for post-processing ensemble weather forecasts. arXiv, 2005.08748v2, https://doi.org/10.48550/arXiv.2005.08748.

  • Grover, A., A. Kapoor, and E. Horvitz, 2015: A deep hybrid model for weather forecasting. Proc. 21th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Sydney, New South Wales, Australia, Association for Computing Machinery, 379–386, https://dl.acm.org/doi/10.1145/2783258.2783275.

  • Hao, S., Y. Zhou, and Y. Guo, 2020: A brief survey on semantic segmentation with deep learning. Neurocomputing, 406, 302321, https://doi.org/10.1016/j.neucom.2019.11.118.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., R. Tibshirani, and J. Friedman, 2009: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2, Springer, 767 pp.

  • Haupt, S. E., W. Chapman, S. V. Adams, C. Kirkwood, J. S. Hosking, N. H. Robinson, S. Lerch, and A. C. Subramanian, 2021: Towards implementing artificial intelligence post-processing in weather and climate: Proposed actions from the Oxford 2019 workshop. Philos. Trans. Roy. Soc., A379, 20200091, https://doi.org/10.1098/rsta.2020.0091.

    • Search Google Scholar
    • Export Citation
  • He, S., X. Li, T. DelSole, P. Ravikumar, and A. Banerjee, 2020: Sub-seasonal climate forecasting via machine learning: Challenges, analysis, and advances. arXiv, 2006.07972v2, https://doi.org/10.48550/arXiv.2006.07972.

  • He, S., X. Li, L. Trenary, B. A. Cash, T. DelSole, and A. Banerjee, 2021: Learning and dynamical models for sub-seasonal climate forecasting: Comparison and collaboration. arXiv, 2110.05196v1, https://doi.org/10.48550/arXiv.2110.05196.

  • Herman, G. R., and R. S. Schumacher, 2018: “Dendrology” in numerical weather prediction: What random forests and logistic regression tell us about forecasting extreme precipitation. Mon. Wea. Rev., 146, 17851812, https://doi.org/10.1175/MWR-D-17-0307.1.

    • Search Google Scholar
    • Export Citation
  • Hewage, P., M. Trovati, E. Pereira, and A. Behera, 2021: Deep learning-based effective fine-grained weather forecasting model. Pattern Anal. Appl., 24, 343366, https://doi.org/10.1007/s10044-020-00898-1.

    • Search Google Scholar
    • Export Citation
  • Hwang, J., P. Orenstein, K. Pfeiffer, J. Cohen, and L. Mackey, 2018: Improving subseasonal forecasting in the Western U.S. with machine learning. arXiv, 1809.07394v3, https://doi.org/10.48550/arXiv.1809.07394.

  • Hwang, J., P. Orenstein, J. Cohen, K. Pfeiffer, and L. Mackey, 2019: Improving subseasonal forecasting in the western U.S. with machine learning. Proc. 25th ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, Anchorage, AK, Association for Computing Machinery, 2325–2335, https://dl.acm.org/doi/10.1145/3292500.3330674.

  • Iglesias, G., D. C. Kale, and Y. Liu, 2015: An examination of deep learning for extreme climate pattern analysis. The Fifth Int. Workshop on Climate Informatics, Boulder, CO, NCAR, 8–9.

  • Jing, Y., Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song, 2020: Neural style transfer: A review. IEEE Trans. Vis. Comput. Graphics, 26, 33653385, https://doi.org/10.1109/TVCG.2019.2921336.

    • Search Google Scholar
    • Export Citation
  • Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Reanalysis Project. Bull. Amer. Meteor. Soc., 77, 437472, https://doi.org/10.1175/1520-0477(1996)077<0437:TNYRP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Khan, S., M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, 2022: Transformers in vision: A survey. ACM Comput. Surv., 54, 200, https://doi.org/10.1145/3505244.

    • Search Google Scholar
    • Export Citation
  • Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.

  • Kirtman, B. P., and Coauthors, 2014: The North American multimodel ensemble: Phase-1 seasonal-to-interannual prediction; Phase-2 toward developing intraseasonal prediction. Bull. Amer. Meteor. Soc., 95, 585601, https://doi.org/10.1175/BAMS-D-12-00050.1.

    • Search Google Scholar
    • Export Citation
  • Knapp, K. R., and Coauthors, 2011: Globally gridded satellite observations for climate studies. Bull. Amer. Meteor. Soc., 92, 893907, https://doi.org/10.1175/2011BAMS3039.1.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2023: Learning skillful medium-range global weather forecasting. Science, 382, eadi2336, https://doi.org/10.1126/science.adi2336.

    • Search Google Scholar
    • Export Citation
  • Loken, E. D., A. J. Clark, and A. McGovern, 2022: Comparing and interpreting differently designed random forests for next-day severe weather hazard prediction. Wea. Forecasting, 37, 871899, https://doi.org/10.1175/WAF-D-21-0138.1.

    • Search Google Scholar
    • Export Citation
  • Lorenc, A. C., 1986: Analysis methods for numerical weather prediction. Quart. J. Roy. Meteor. Soc., 112, 11771194, https://doi.org/10.1002/qj.49711247414.

    • Search Google Scholar
    • Export Citation
  • Mamalakis, A., J.-Y. Yu, J. T. Randerson, A. AghaKouchak, and E. Foufoula-Georgiou, 2018: A new interhemispheric teleconnection increases predictability of winter precipitation in southwestern US. Nat. Commun., 9, 2332, https://doi.org/10.1038/s41467-018-04722-7.

    • Search Google Scholar
    • Export Citation
  • Meinshausen, N., 2006: Quantile regression forests. J. Mach. Learn. Res., 7, 983999.

  • Min, Y.-M., S. Ham, J.-H. Yoo, and S.-H. Han, 2020: Recent progress and future prospects of subseasonal and seasonal climate predictions. Bull. Amer. Meteor. Soc., 101, E640E644, https://doi.org/10.1175/BAMS-D-19-0300.1.

    • Search Google Scholar
    • Export Citation
  • Mouatadid, S., P. Orenstein, G. Flaspohler, J. Cohen, M. Oprescu, E. Fraenkel, and L. Mackey, 2023: Adaptive bias correction for improved subseasonal forecasting. Nat. Commun., 14, 3482, https://doi.org/10.1038/s41467-023-38874-y.

    • Search Google Scholar
    • Export Citation
  • Nagaraj, R., and L. S. Kumar, 2023: Univariate deep learning models for prediction of daily average temperature and relative humidity: The case study of Chennai, India. J. Earth Syst. Sci., 132, 100, https://doi.org/10.1007/s12040-023-02122-0.

    • Search Google Scholar
    • Export Citation
  • Nakada, K., R. M. Kovach, J. Marshak, and A. Molod, 2018: Global modeling and assimilation office. GMAO Office Note 16, 85 pp., https://gmao.gsfc.nasa.gov/pubs/docs/Nakada1033.pdf.

  • Narayanan, A., M. Chandramohan, L. Chen, Y. Liu, and S. Saminathan, 2016: subgraph2vec: Learning distributed representations of rooted sub-graphs from large graphs. arXiv, 1606.08928v1, https://doi.org/10.48550/arXiv.1606.08928.

  • National Academies of Sciences, 2016: Next Generation Earth System Prediction: Strategies for Subseasonal to Seasonal Forecasts. The National Academies Press, 350 pp.

  • NRC, 2010: Assessment of Intraseasonal to Interannual Climate Prediction and Predictability. The National Academies Press, 192 pp.

  • Nebeker, F., 1995: Calculating the Weather: Meteorology in the 20th Century. Elsevier, 225 pp.

  • NOAA, 2022: NOAA National Centers for Environmental Information, climate at a glance: National Time Series. NOAA, https://www.ncdc.noaa.gov/cag/.

  • Pathak, J., and Coauthors, 2022: FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/arXiv.2202.11214.

  • Pavlyshenko, B., 2018: Using stacking approaches for machine learning models. 2018 IEEE Second Int. Conf. on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, Institute of Electrical and Electronics Engineers, 255–258, https://doi.org/10.1109/DSMP.2018.8478522.

  • Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 28252830.

  • Petroni, F., T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel, 2019: Language models as knowledge bases? arXiv, 1909.01066v2, https://doi.org/10.48550/arXiv.1909.01066.

  • Price, I., and Coauthors, 2023: GenCast: Diffusion-based ensemble forecasting for medium-range weather. arXiv, 2312.15796v2, https://doi.org/10.48550/arXiv.2312.15796.

  • Radhika, Y., and M. Shashi, 2009: Atmospheric temperature prediction using support vector machines. Int. J. Comput. Theory Eng., 1, 17938201.

    • Search Google Scholar
    • Export Citation
  • Reynolds, R. W., T. M. Smith, C. Liu, D. B. Chelton, K. S. Casey, and M. G. Schlax, 2007: Daily high-resolution-blended analyses for sea surface temperature. J. Climate, 20, 54735496, https://doi.org/10.1175/2007JCLI1824.1.

    • Search Google Scholar
    • Export Citation
  • Romano, Y., E. Patterson, and E. J. Candès, 2019: Conformalized quantile regression. Advances in Neural Information Processing Systems, Vancouver, BC, Canada, NeurIPS, https://proceedings.neurips.cc/paper_files/paper/2019/hash/5103c3584b063c431bd1268e9b5e76fb-Abstract.html.

  • Ronneberger, O., P. Fischer, and T. Brox, 2015: U-Net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Springer, 234–241.

  • Saha, S., and Coauthors, 2014: The NCEP Climate Forecast System version 2. J. Climate, 27, 21852208, https://doi.org/10.1175/JCLI-D-12-00823.1.

    • Search Google Scholar
    • Export Citation
  • Seager, R., and Coauthors, 2007: Model projections of an imminent transition to a more arid climate in southwestern North America. Science, 316, 11811184, https://doi.org/10.1126/science.1139601.

    • Search Google Scholar
    • Export Citation
  • Simmons, A. J., and A. Hollingsworth, 2002: Some aspects of the improvement in skill of numerical weather prediction. Quart. J. Roy. Meteor. Soc., 128, 647677, https://doi.org/10.1256/003590002321042135.

    • Search Google Scholar
    • Export Citation
  • Srinivasan, V., J. Khim, A. Banerjee, and P. Ravikumar, 2021: Subseasonal climate prediction in the Western US using bayesian spatial models. Uncertainty in Artificial Intelligence UAI 2021, Online, PMLR, 961–970, https://experts.illinois.edu/en/publications/subseasonal-climate-prediction-in-the-western-us-using-bayesian-s.

  • Subbaswamy, A., R. Adams, and S. Saria, 2021: Evaluating model robustness and stability to dataset shift. 24th Int. Conf. on Artificial Intelligence and Statistics, AISTATS 2021, Online, PMLR, 2611–2619, https://pure.johnshopkins.edu/en/publications/evaluating-model-robustness-and-stability-to-dataset-shift.

  • Totz, S., E. Tziperman, D. Coumou, K. Pfeiffer, and J. Cohen, 2017: Winter precipitation forecast in the European and Mediterranean regions using cluster analysis. Geophys. Res. Lett., 44, 12 41812 426, https://doi.org/10.1002/2017GL075674.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, 2017: Attention is all you need. Advances in Neural Information Processing Systems, Curran Associates, Inc., 6000–6010.

  • Vitart, F., A. W. Robertson, and D. L. Anderson, 2012: Subseasonal to seasonal prediction project: Bridging the gap between weather and climate. Bull. WMO, 61, 23.

    • Search Google Scholar
    • Export Citation
  • White, C. J., and Coauthors, 2022: Advances in the application and utility of subseasonal-to-seasonal predictions. Bull. Amer. Meteor. Soc., 103, E1448E1472, https://doi.org/10.1175/BAMS-D-20-0224.1.

    • Search Google Scholar
    • Export Citation
  • Wiles, O., and Coauthors, 2021: A fine-grained analysis on distribution shift. arXiv, 2110.11328v2, https://doi.org/10.48550/arXiv.2110.11328.

  • Wu, K., H. Peng, M. Chen, J. Fu, and H. Chao, 2021: Rethinking and improving relative position encoding for vision transformer. Proc. IEEE/CVF Int. Conf. on Computer Vision, Online, Institute of Electrical and Electronics Engineers, 10 033–10 041, https://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00988.

  • Xie, P., M. Chen, and W. Shi, 2010: CPC unified gauge-based analysis of global daily precipitation. 24th Conf. on Hydrology, Atlanta, GA, Amer. Meteor. Soc., 2.3A, https://ams.confex.com/ams/90annual/techprogram/paper_163676.htm.

  • Yakubovskiy, P., 2020: Segmentation models pytorch. GitHub, https://github.com/qubvel/segmentation_models.pytorch.

  • Yang, S., F. Ling, Y. Li, and J.-J. Luo, 2023: Improving seasonal prediction of summer precipitation in the middle–lower reaches of the Yangtze River using a TU-Net deep learning approach. Artif. Intell. Earth Syst., 2, 220078, https://doi.org/10.1175/AIES-D-22-0078.1.

    • Search Google Scholar
    • Export Citation
  • Zhu, Q., N. Ponomareva, J. Han, and B. Perozzi, 2021: Shift-robust GNNS: Overcoming the limitations of localized graph training data. Advances in Neural Information Processing Systems, Curran Associates, Inc., 27 965–27 977.

Save