1. Introduction
In recent years, the performance of machine learning (ML) weather forecast models has improved drastically (Rasp et al. 2024), leading some authors to speak of a “rise” of ML methods in weather forecasting (Ben Bouallègue et al. 2024) or even a second revolution of the field. While some studies focused on short-term predictions (“nowcasting”; Espeholt et al. 2022; Leinonen et al. 2023; Andrychowicz et al. 2023) and long-term subseasonal to seasonal forecasting (2 weeks to 2 months ahead; Weyn et al. 2021; Lopez-Gomez et al. 2023), much work concentrated on the medium range (Rasp and Thuerey 2021; Pathak et al. 2022; Nguyen et al. 2023a; K. Chen et al. 2023; Bi et al. 2023; Kochkov et al. 2024; Lam et al. 2023; L. Chen et al. 2023; Nguyen et al. 2023b; Price et al. 2023), i.e., forecasting days to 2 weeks into the future.
The established technique for weather forecasting in the medium range is numerical weather prediction (NWP), which is based on evolving an estimate of the current weather state constructed from observations through time under differential equations. Therefore, the point of comparison for ML approaches is ECMWF’s Integrated Forecasting System (Owens and Hewson 2018), including its high-resolution forecast system (HRES), which is generally considered to be the most reliable NWP model for global deterministic weather forecasts. The latest ML weather models match or even outperform HRES in terms of overall summary scores across many variables, pressure levels, and prediction lead times (Rasp et al. 2024). In addition to performance, other reasons to consider ML-based weather forecasting include energy efficiency during operations and improved inference speed. ML models that supplement or replace parts of the weather forecasting pipeline are increasingly seen as a realistic possibility (Bauer 2024; Ben Bouallègue et al. 2024). ECMWF is already publishing forecast data produced with its own artificial intelligence (AI) model, AI Forecasting System (AIFS) (Lang et al. 2024), as part of its experimental suite.
Given these recent advances and the immense importance of accurate and robust weather forecasting to many aspects of human life, thorough analyses are necessary before operationalizing ML weather prediction models. As extreme weather events often have severe impacts (Zscheischler et al. 2020; Seneviratne et al. 2023), such as crop loss, wildfires, and floods, effective mitigation measures require accurate predictions in the tails of the distribution.
While ML-based weather forecasts can achieve high overall accuracy, their performance for extreme events is not well understood. ML models generally face fundamental difficulties during extrapolation and generalization to unseen domains, and good test accuracy estimates do not guarantee good performance outside the range of previous observations or in regions of the input space where observations were scarce (Hastie et al. 2009; Watson 2022).
Summary scores, like the root-mean-square error (RMSE), play a central role in the evaluation of ML weather prediction models. Typically, one score is computed for each lead time, predicted variable, and (pressure) level to quantify the model’s performance over the entire test set (see, e.g., scorecards in Rasp et al. 2024). Several other aspects of ML forecasts have also been studied in the literature. For instance, Bonavita (2024), Lam et al. (2023), and Rasp et al. (2024) examined the smoothness of the predictions and found that most ML models tend to blur predictions for long lead times as a consequence of the way these models are conceptualized and trained. Bonavita (2024) studied Pangu-Weather, one of the best-performing ML models, and found it to be worse at maintaining physical balances than ECMWF’s HRES.
Olivetti and Messori (2024a) summarized some of the extreme event evaluations performed for the latest generation of ML weather forecast models. In previous work, extreme temperatures (both hot and cold) were studied by comparing threshold exceedances of predictions and ground truth data (Ben Bouallègue et al. 2024; Lam et al. 2023; Olivetti and Messori 2024b). Other types of investigated extreme events include tropical cyclones, atmospheric rivers, and storm systems (Magnusson 2023; Ben Bouallègue et al. 2024; Lam et al. 2023; Charlton-Perez et al. 2024). While some studies have looked into individual events, many types of extremes are still underexplored, especially on a case study level.
In addition, little attention has been paid to impact metrics that combine multiple predicted variables or to events where accurate assessment of their spatial or temporal extent is important. The compounding effect of multiple variables in space and time can lead to particularly large impacts (Zscheischler et al. 2020). Examining prediction performance for these events in case studies is necessary to increase public trust in ML models and also has the potential to uncover rare systematic errors in the ML model predictions that might be hidden by summary scores.
This study evaluates the ability of three popular ML weather prediction models to accurately forecast relevant impact metrics of extreme weather events through three case studies. The ML models GraphCast (Lam et al. 2023), Pangu-Weather (Bi et al. 2023), and FourCastNet (Pathak et al. 2022) are compared to IFS HRES (Owens and Hewson 2018) for the 2021 Pacific Northwest heatwave, the 2023 South Asian humid heatwave, and the 2021 North American winter storm.
2. Data and models
a. Data
In this paper, we use two kinds of data: ERA5 reanalysis data (Hersbach et al. 2020) and ECMWF HRES analysis data. All ML models considered in this study were trained on ERA5, which is produced using data assimilation, i.e., by combining observations with short-range forecasts to obtain a “best guess” of the actual weather state. ERA5 has a horizontal resolution of 0.25° × 0.25°, an hourly temporal resolution, and provides estimates of many atmospheric, land, and oceanic climate variables over the globe from 1940 to the present. The ML models GraphCast and FourCastNet have an internal time step of 6 h and were trained on a subset of ERA5 at 0000, 0600, 1200, and 1800 UTC. Therefore, we also restrict our analyses to these times of day.
We use HRES forecasts of versions 47r1, 47r2, and 47r3 from ECMWF’s Integrated Forecasting System. They have a horizontal resolution of 0.1° × 0.1°, and we downsample them to the 0.25° × 0.25° grid using the default Meteorological Interpolation and Regridding (MIR) library in the ECMWF Meteorological Archival and Retrieval System (MARS). ERA5 and HRES data can be retrieved from online archives. HRES forecasts initialized at 0000 or 1200 UTC are archived for lead times up to 10 days, while forecasts initialized at 0600 and 1800 UTC are only available for lead times up to 3.75 days. To ensure a fair comparison, we use “HRES forecast at step 0” (HRES-fc0) as the ground truth for HRES forecasts. If ERA5 was used instead, HRES would have a nonzero error at lead time 0 h. We use HRES forecast data with lead times ranging from 0 h to the maximum available length in steps of 6 h so that the lead times match those of the ML forecasts.
A difference between our comparison study and Ben Bouallègue et al. (2024) is the data used for initializing and evaluating ML models. In both cases, they used HRES data to ensure fairness in an operational context since ERA5 reanalysis data are simply not available at the time of an operational prediction. From an ML perspective, this will unfortunately bring disadvantages to ML models because they are trained on ERA5 data. Here, we choose the conventional approach in ML studies (Pathak et al. 2022; Bi et al. 2023; Lam et al. 2023; L. Chen et al. 2023) and use ERA5 data to initialize and evaluate ML models.
b. Machine learning models for weather forecasting
We focus on three recent ML models in this work: FourCastNet, Pangu-Weather, and GraphCast. Table 1 summarizes the main characteristics of these models. More details on the variables predicted by these models are presented in section 1 of the online supplemental material.
Summary of key features of recent ML-based weather forecasting models.
Bi et al. (2023) trained four models with different lead times (1, 3, 6, and 24 h). These models are combined during inference to achieve the minimum number of model executions for a given forecast lead time (“hierarchical temporal aggregation strategy”), thereby minimizing error accumulation (Bi et al. 2023). For instance, to forecast the weather state in 36 h, using the 24-h model once plus two iterations of the 6-h model gives a more accurate forecast than simply using the 1-h model iteratively for 36 times. Because we only consider lead times that are multiples of 6 h in our study, we use sequences of 6- and 24-h model calls to achieve the smallest possible forecast error for Pangu-Weather.
We also highlight that GraphCast requires the weather states at two consecutive time points as inputs to each forecast, while the other two models only need one. Furthermore, as shown in Table 1, for each time step, the dimensionality of the input data (especially the number of pressure levels) of GraphCast is substantially higher than that of Pangu-Weather and FourCastNet. These two distinctions might advantage GraphCast in comparison to the other two ML models since using additional covariate information in principle tends to increase the potential predictive accuracy.
Although forecast data from various ML models are available on WeatherBench 2 (Rasp et al. 2024), they do not cover all periods we investigate (e.g., GraphCast forecasts are only available for 2018 and 2020). We thus produced additional forecast data for more recent events by running the ML models ourselves. More precisely, we implemented the inference of the three ML models by directly leveraging their pretrained models released on GitHub. Alternatively, the forecast data can be generated using the ECMWF library “ai-models,” but note that the GraphCast model in the library is GraphCast operational [a smaller version than the one described in Lam et al. (2023)], which was pretrained on ERA5 data from 1979 to 2017 and fine-tuned on HRES data from 2016 to 2021, and only includes atmospheric variables at 13 pressure levels as input and output.
c. Initialization times
ERA5 and HRES-fc0 differ in their assimilation windows (Lam et al. 2023). While observations up to 3 h into the future are included in the assimilation for HRES-fc0, the lookahead for ERA5 varies between initialization times: 3 h for forecasts initialized at 0600/1800 UTC and 9 h for forecasts initialized at 0000/1200 UTC. To ensure an equal lookahead, Lam et al. (2023) compared ML-based and HRES forecasts initialized at 0600/1800 UTC for lead times up to the availability of HRES (3.75 days). Beyond this time limit, ML forecasts initialized at 0600/1800 UTC are compared with HRES forecasts initialized at the preceding 0000/1200 UTC. We follow this mixed initialization methodology in one of our analyses in section 3a.
As shown in supplemental section 5.2 of Lam et al. (2023), the effect of unequal lookahead is small, particularly for long lead times. Therefore, for all analyses except the RMSE comparison in the first case study, we include all forecasts (0000, 0600, 1200, and 1800 UTC initialization times) in our analysis. Additionally, we extend the short HRES forecasts initialized at 0600/1800 UTC beyond lead times of 4 days; these forecasts are augmented with data from the forecasts initialized 6 h prior to the 0600/1800 UTC initialization time while increasing the lead time by 6 h so that the validity time of the forecast remains the same. This filling might disadvantage HRES, but it enables the analysis of a denser set of initialization and lead times.
3. Case studies
a. 2021 Pacific Northwest heatwave
In this first case study, we investigate a record-shattering extreme temperature event. In late June 2021, a heatwave of unprecedented magnitude hit the Pacific Northwest with temperatures reaching up to 49.6°C, beating the all-time record for Canada by 4.6 K (Fig. 1a). Even in hindsight, quantifying the return period of the event is challenging (Bartusek et al. 2022; Philip et al. 2022; Zeder et al. 2023). While the impacts caused by such extreme events can be substantially large, prediction of such events is also challenging for ML models, due to the scarcity of similar events in training data. On the other hand, even though NWP models are more directly bound to follow physical laws, their forecast accuracy is not guaranteed either.
Magnitudes of the three events analyzed in this paper. (a) 2021 Pacific Northwest heatwave. Shown is the 2-m temperature anomaly averaged over 27–29 Jun 2021, the peak of the heatwave. (b) 2023 South Asian humid heatwave. Shown is the category of maximum daily HI, as defined in appendix B, section a, averaged over 17–20 Apr 2023 in India and Bangladesh. (c) 2021 North American winter storm. Shown is the wind chill index Twc, as defined in section 3c, at 1200 UTC 15 Feb 2021.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
The heatwave impacted ecosystems, infrastructure, and human health considerably (with more than 1400 deaths) and attracted massive public attention and scientific interest (Neal et al. 2022; Schumacher et al. 2022; White et al. 2023; Röthlisberger and Papritz 2023). In the analyzed region, temperatures peaked between 27 and 29 June. We analyze the heatwave in terms of the temperature at 2 m above the surface (T2m), which is a standard variable for studying temperature extremes.
In the grid cells closest to three major population centers affected by the heatwave (Vancouver, Seattle, and Portland), the prediction error of HRES and all tested ML models reaches at least twice the size of a typical HRES 10-day prediction error and exceeds the typical HRES 10-day error in Portland by a factor of 4. This is consistent with the results of Lin et al. (2022), who examined the predictions of subseasonal to seasonal NWP models for the Pacific Northwest heatwave and found that all models failed to predict the magnitude of the heatwave for forecasts initialized on 17 June, i.e., 10 days before temperatures began to peak. We visualize our prediction errors in predictability barrier plots in Fig. 2, using HRES-fc0 as ground truth for the HRES forecasts and ERA5 as ground truth for the ML-based predictions. We aggregate to daily scale by computing RMSEs as described in appendix A, section a. A version of the plot showing T2m,prediction − T2m,groundtruth without this aggregation is presented in the supplemental material. Maps of the temperature anomaly patterns predicted for the peak of the heatwave for forecasts with different initialization times are shown in Fig. A2.
[a(1)–d(3)] Predictability barrier plots for the grid cells closest to major cities affected by the 2021 heatwave. For HRES, HRES-fc0 is used as ground truth; for the ML models, we use ERA5 instead. In the color bar, D5 and D10 indicate long-term multiyear average HRES 5- and 10-day prediction errors. For the computation of the RMSE, D5, and D10, see appendix A, section a. Numerical values for D5 and D10 are given in Table A1. [e(1)–e(3)] Time series of daily maximum T2m for the datasets used as ground truth.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
FourCastNet has the largest error among all models, while the errors of Pangu-Weather, GraphCast, and HRES are of a similar magnitude visually (Fig. 2). For all models, the prediction errors are largest during the peak of the heatwave. The predictability barrier plots exhibit prominent vertical structures (i.e., forecasts for the same validity day), suggesting that the dominant factor is the predictability of the weather situation rather than the forecast initialization. However, HRES seems to also exhibit hints of diagonal error structures. This structural difference in error patterns is further discussed in section 3c, where it is more significant.
For HRES, the prediction errors in the first days of July 2021, when temperatures started to fall again, are larger than for Pangu-Weather and GraphCast, especially for the grid cells closest to Seattle and Portland. For long lead times, however, the HRES errors reach their largest values during the heatwave peak around 27–29 June in all three grid cells. The predictability barrier plots for Pangu-Weather appear very patchy, likely due to the hierarchical temporal aggregation strategy of Pangu-Weather (described in section 2b).
The best- and worst-performing models across various lead and validity times are visualized in Fig. A3. Conclusions match those from Fig. 2; FourCastNet has the largest errors during the heatwave, and HRES has comparatively high errors after the peak of the heatwave. During many of the time steps, especially for short lead times, GraphCast and HRES yield the smallest error. However, there is no clear best-performing model overall.
To assess the models’ performance in predicting the extreme event, we compute the forecast RMSEs of all models during the peak of the heatwave and compare them to the RMSEs during the summer of 2022, a baseline year without extreme heatwaves in the region. We vary the lead time and study the event in the region defined by Philip et al. (2022). The RMSE aggregation here follows Lam et al. (2023): it includes latitude-based weights, and only forecasts initialized at 0600/1800 UTC and lead times in multiples of 12 h are considered to ensure equal assimilation windows between ERA5 and HRES-fc0. The results, shown in Fig. 3, again highlight the difficulty of predicting the extreme temperatures during the event: for lead times beyond 1 week, all models perform substantially worse than for the summer of 2022 baseline. More precisely, all ML models perform up to at least three times worse and HRES up to two times worse. Given the small sample size, these numerical values should be interpreted with care, however. We also observe that for lead times up to 6.5 days, the forecast errors of HRES are smaller than ML models, contrary to their performance in the baseline year. This might be a consequence of extrapolation, as discussed in the introduction. As the evaluated baseline period is rather short, mainly for computational reasons, the baseline might not precisely represent the typical performance. However, the baseline results are in line with other studies (Lam et al. 2023) and additional baseline data from the year 2020 (Fig. A1 in appendix A).
Evolution of the T2m prediction RMSE with a lead time for the three ML models and HRES in the event region during (left) the peak of the heatwave (27–29 Jun 2021) compared to (right) summer 2022 as a baseline (20 Jun–10 Jul). Observations in the considered box region, 45°–52°N, 119°–123°W, are weighted to correct for differences in gridcell area. ML models use 0600/1800 UTC initial conditions and evaluation times only, and the HRES forecasts use the mixed initialization described in section 2c after 3.75 days (dotted line).
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
A further analysis focusing on the spatial aspect of the event is presented in supplemental section 2. A main conclusion is that FourCastNet underpredicts the area in which temperature anomalies exceed a given threshold, while in some Pangu-Weather forecasts, the area predicted to exceed the thresholds is too large.
b. 2023 South Asian humid heatwave
In April 2023, high temperature and humidity levels were reached simultaneously in South Asia (Fig. 1b). The human tolerance to high temperatures decreases with increasing humidity, mainly due to the inability of the body to self-regulate its temperature through transpiration. Heat stress associated with this type of event can therefore be particularly harmful to human health (Buzan and Huber 2020; Lo et al. 2023).
The heat index (HI) is an impact metric quantifying this hazard to human health. It estimates the apparent temperature (i.e., how hot the temperature feels) for given values of temperature (T2m) and relative humidity (RH). While many metrics have been proposed to combine the influence of these two variables (Lo et al. 2023), we follow Zachariah et al. (2023), who employ the modified version of the heat index (Rothfusz 1990) used by the NOAA Weather Prediction Center in an attribution study on the 2023 South Asian humid heatwave. The detailed computations, including information on how we convert predicted specific humidity to relative humidity, are given in appendix B, section c.
Following Zachariah et al. (2023), we focus on two study regions in South Asia: Laos–Thailand (for which results are presented in supplemental section 3) and India–Bangladesh. For the latter, a subregion with a dry and semiarid climate is excluded from the analysis (see appendix B, section a, for details). We select a temporal range of 17–20 April 2023 (UTC time, inclusive range) for the India–Bangladesh region, corresponding to the period in which the heat stress peaked.
With existing ML weather prediction models, HI at the surface cannot be calculated correctly because humidity is only modeled at higher pressure levels, and none of the models predict a variable that could enable the calculation of relative humidity at the surface level. This presents a strong limitation on the utility of ML models in forecasting humid heatwaves. While GraphCast and Pangu-Weather forecast variables at the 1000-hPa level, FourCastNet predictions only include variables at pressure levels starting from 850 hPa. In the following, we exclude FourCastNet from the analysis and use relative humidity at the 1000-hPa level as an approximation for humidity at the surface.
The HI prediction error during the peak of the heatwave in the India–Bangladesh region is shown in Fig. 4. For each day, we select the time when the ground truth HI is maximal and then average the errors over 18–20 April. In all cases, the predicted HI is computed using T2m and RH1000hPa. This setup is the simplest possible substitute that forecasters could use with the ML models available at the time of writing. The forecasts show deviations from the ground truth datasets, especially over Bangladesh. The underprediction for ML models is stronger than for HRES. Looking at the prediction errors of RH1000hPa, we find a matching pattern: predictions for relative humidity over Bangladesh are too low, especially for Pangu-Weather (Fig. B1). For T2m, the values at the time of the HI peak are mostly smaller than the corresponding ground truth for the ML methods, while HRES T2m predictions are larger than the HRES-fc0 ground truth (Fig. B1).
Error of the HI prediction, for the time step of each day during which HI peaked in the ground truth dataset, averaged over 17–20 Apr 2023. For all forecasting methods and ground truth datasets, HI is computed using RH1000hPa rather than the value at the surface.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
For HRES, HRES-fc0, and ERA5, it is possible to compute RH at the surface level (RHsfc) from 2-m temperature and 2-m dewpoint temperature (see appendix B, section c). We found rather large differences between the ground truth datasets ERA5 and HRES-fc0 for the studied event, however; therefore, we used RH1000hPa in the computations for Fig. 4.
Large fractions of the India–Bangladesh region experienced a mean daily maximum HI during 17–20 April 2021 that falls in the “extreme caution” or “danger” category (Fig. 5; see Table B1 for the definition of the categories). In Fig. 5, the HI distribution computed from ERA5 data using the input variables T2m and RHsfc differs strongly from the other ground truth datasets, mainly due to ERA5 RHsfc values being higher during daily maximum HI. This may be a consequence of differences in the assimilation procedure used to produce the ground truth data. The ML-based HI forecasts (computed using RH1000hPa) underestimate the ERA5 HI values both when RH1000hPa or RHsfc is used. This is especially the case for high values of HI. Results for the Laos–Thailand region are also in line with these findings (see supplemental section 3).
The proportion of the area in the study region with the given mean daily maximum HI during 17–20 Apr 2023, computed using area-weighted kernel density estimation. Shaded areas in the background indicate threat levels (see appendix B, section c). Light gray to dark gray shades indicate low risk, caution, extreme caution, danger, and extreme danger, respectively. Compared are distributions resulting from forecasts initialized 6 days prior to the start of the event and different ground truths: ERA5 and HRES-fc0, each in two versions of computing the HI either using RHsfc or using the substitute RH1000hPa. For HRES forecasts, we show versions computed with RH1000hPa and RHsfc as well.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
Figure B3 visualizes the heat index forecast for the peak of the heatwave by forecasts with different initialization times (in terms of threat categories; see Table B1). Results match the other analyses in this section.
c. 2021 North American winter storm
While heatwaves often receive a lot of media attention, especially in light of anthropogenic climate change, cold spells are also hazardous. Under the current climate, cold extremes lead to more human deaths overall than hot extremes (Gasparrini et al. 2015). In mid-February 2021, a winter storm hit large parts of the United States, Northern Mexico, and Canada (Fig. 1c). Rapidly falling temperatures were accompanied by snow, sleet, freezing rain, and strong winds, causing damage to human livelihood and infrastructure (NWS 2021). In Texas, which was strongly affected by the event, pipes burst, interrupting the water distribution, and energy infrastructure failed, resulting in power outages and ordered rolling blackouts. Impacts were amplified by inadequate wintering of energy infrastructure (Gruber et al. 2022).
Looking at predictions of Twc for the grid cell closest to College Station in Fig. 6, one can see that all models struggle to predict the minimum wind chill index, with forecast errors being largest for FourCastNet (sometimes exceeding 40 K at minimum Twc for large lead times). In general, errors are larger for the winter storm than for the Pacific Northwest heatwave, which might be due to a potential seasonality in prediction errors [as suggested by Fig. 2 in Ben Bouallègue et al. (2024)] or, simply, to the different nature of the events. Errors for Pangu-Weather and GraphCast are substantially lower than for HRES, especially between 9 and 17 February, after the peak of the winter storm.
(a)–(d) The Twc prediction errors for different validity and lead times. Data are from the grid cell closest to College Station, Texas. Times and dates are given in UTC. (e) Time series of the ground truth datasets used in the computation: ERA5 is used for the ML forecasts, and HRES-fc0 is used for HRES.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
The vertical structures in the plot (which are particularly prominent for GraphCast and Pangu-Weather on 15–17 February) hint at the difficulty of predicting weather situations, while the diagonal structures (strong for FourCastNet and HRES) suggest variation between individual forecasts caused by their initial conditions. In general, HRES seems to produce stronger diagonal error structures, while the ML models tend to exhibit vertical error patterns. While the data filling we use to extend HRES forecasts might affect this finding, it could also be a result of fundamental differences between ML-based and NWP forecasts. Assuming a very extreme event that is easily predictable following physical laws, the predictions of HRES would steadily improve when approaching the event. ML methods, however, might not be able to extrapolate to such extreme conditions, even at very short lead times, resulting in vertical error patterns. On the other hand, if an extreme event that is difficult to predict with (first order) physical laws is somewhat hidden in the atmospheric state of the initial conditions day, HRES would have trouble forecasting both the event and its buildup, leading to a diagonal pattern. Then, when the event becomes more apparent from the conditions, the prediction will improve. The ML methods might be able to model such “hidden” (second order) conditions better because of their flexibility, leading to less strong diagonal patterns.
When the predicted temperatures are too high or the predicted wind speeds too low, the thresholds in the definition of Twc are not exceeded, and Twc is thus not defined. This is the case during the wind chill minimum between 15 and 17 February even though these were the most hazardous days. In Figs. 6 and 7, we ignore the thresholds in the definition and still compute the Twc expression for visual clarity.
Errors of Twc forecasts for 1200 UTC 15 Feb 2021 (0600 LT, Houston time). The ground truth used to compute results for the ML forecasts is ERA5, while HRES-fc0 is used for HRES.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
Patterns in the prediction of T2m look similar to those of Twc (Fig. C2), with FourCastNet errors being the largest and HRES errors during the event being larger than Pangu-Weather and GraphCast errors. For the surface wind speed, which also enters Eq. (1), the patterns are not as clear (Fig. C3). Therefore, the Twc errors seem to be dominated by T2m.
Figure 7 depicts the forecast errors in a bounding box around Texas (107°–93°W, 25°–37°N) for 1200 UTC 15 February 2021 (0600 LT, Houston time) when the average wind chill index in the box reached a minimum in the ground truth data. For long lead times, forecast errors of GraphCast and Pangu-Weather seem smaller than those of FourCastNet and HRES although all forecasts appear to be too warm. One can also notice a slightly “patchy” structure in the FourCastNet predictions, with notable discontinuities between 8 × 8 patches. The size 8 × 8 is the patch size used internally by FourCastNet.
4. Discussion and conclusions
The three case studies conducted highlight different aspects of comparison between the ML models: GraphCast, Pangu-Weather, FourCastNet, and the NWP HRES. For the 2021 Pacific Northwest heatwave, predictions of Pangu-Weather and GraphCast maintained comparable quality to HRES in terms of the evaluated metrics. However, for short lead times, HRES showed smaller forecast errors in both the predictability barrier plot and RMSE plot than ML models, contrary to their performances in the baseline summer of 2020 or 2022, indicating that ML models might be more severely impacted by the extrapolation to those extreme conditions. We also observe that HRES has more difficulties predicting the sharp drop in the temperature after the peak of the heatwave than ML models. When analyzing the South Asia humid heatwave substituting RHsfc with RH1000hPa, prediction errors show spatial patterns with the highest danger levels over Bangladesh being underestimated by the ML models. For many lead times and initial conditions, the North American winter storm is forecast more accurately by Pangu-Weather and GraphCast than by HRES. From those predictions, we observe structurally different error patterns: HRES and FourCastNet are potentially more impacted by subtle signals in the initial conditions than GraphCast and Pangu-Weather, leading to errors that build up before the event. We emphasize that our findings are limited to the three case studies, and more systematic analyses need to be conducted to reach definitive conclusions about general extreme weather event forecasts.
None of the ML models predict a variable that enables the computation of surface-level humidity. This would have allowed us to better study the effects of the 2023 humid heatwave in South Asia, as surface humidity alters the effect of temperature on the human body. Looking at substitute variables, ML models seem to perform worse for this event overall, potentially due to extrapolation. Whether this effect persists for ML models that do predict surface humidity remains to be answered in future research. The rather large differences in relative humidity at the surface level between the ERA5 dataset and HRES-fc0 complicate the estimation of “true” heat index forecasting errors. One way to resolve this would be to directly compare against station observations.
Comparing forecast systems on only a subset of extreme events incurs the danger of favoring alarmist forecasts, a phenomenon termed the “forecaster’s dilemma” (Lerch et al. 2017). This is a general problem of case studies and even applies to a broader class of analyses. Such results should not be used alone to judge the overall quality of weather forecasting systems. The existing literature on evaluating ML-based weather forecasts described in section 1 can be combined with our findings to obtain a more complete picture.
Our study only uses single forecasts and disregards probabilistic forecasting. In NWP, forecast uncertainty is accounted for by running ensemble forecasts. While including NWP ensemble forecasts is possible, this would have caused further complications in the analyses, e.g., due to differing model resolutions. Furthermore, producing ensemble forecasts with the given ML models is nontrivial. Attempts have been made, e.g., by perturbing initial conditions or model parameters (Weyn et al. 2021; Bi et al. 2023; Bülte et al. 2024), but problems capturing the right scaling of uncertainties have been found in the literature. Selz and Craig (2023) investigated Pangu-Weather and found that the error growth for small perturbations is too small (“no butterfly effect”). Recently, Price et al. (2023) explored generative modeling to obtain better ensembles. Because of the generative training objective, these models can better capture the spectrum of the weather at long lead times, avoiding the oversmoothing that occurs for autoregressive models like GraphCast, Pangu-Weather, and FourCastNet. For the evaluation of generative ML-based weather forecast models, proper scoring rules (Gneiting and Katzfuss 2014), like the class described by Allen et al. (2023) for probabilistic forecasts, will be an important analysis tool. The comparison of ML models with HRES is also limited by differences between the ground truth datasets ERA5 and HRES-fc0 (differing assimilation times and short forecasts for 0600/1800 UTC initializations). However, this does not affect the comparison between the three ML models. ML weather prediction models are typically trained using ERA5 data, which does not correspond to an “operational setting” and complicates the comparison with HRES. While ML models could be trained or fine-tuned with HRES-fc0 data directly, the IFS version used to produce the forecasts varies over time; therefore, characteristics and biases of the “ground truth” HRES-fc0 would also vary.
Most ML models employ a large autoregressive time step (6 h for GraphCast and FourCastNet). This coarse temporal resolution might affect the forecast of impacts for which the daily maximum or minimum is relevant, such as short-term heat stress peaks or severe wind gusts. The most extreme values might be missed due to an unfortunate combination of forecast time step and daily cycle or event time. Some important variables for impact assessments are forecast by few or no ML models. These include humidity at the surface, solar radiation reaching the surface (potentially relevant for solar energy production forecasts), and precipitation. While some ML models (including GraphCast and FourCastNet) do predict precipitation, authors have advised caution in the interpretation of this variable, citing issues with the ERA5 precipitation ground truth (Lavers et al. 2022).
While case studies can only provide anecdotal evidence, testing ML models under individual extreme events can reveal unexpected deficiencies (or advantages) of these models in comparison to well-established techniques. The rather small number of meteorological variables predicted by ML models, as well as the available forecast lead time, limits the types of impactful extreme events that can be studied for these models. While longer forecasts would be interesting and would allow the study of more complex types of extreme events (Zscheischler et al. 2020), one would likely need to include new processes and variables in the models, such as feedback from soil moisture and the influence of sea surface temperatures.
Nonlinear combinations of predicted output variables [e.g., wind chill; see Eq. (1)] have the potential to reveal weaknesses of ML models; Price et al. (2023) investigated horizontal surface wind speed (a nonlinear function of the horizontal wind components) and found that GraphCast tends to perform worse in terms of this combined metric than for the individual components. They hypothesized that this might be due to the tendency of a certain type of ML architecture to predict close to the mean under forecast uncertainty and the noncommutativity of nonlinear function applications and averaging. However, in our case studies (Twc during the 2021 North American winter storm and HI during the 2023 South Asia humid heatwave), the large differences in prediction errors for individual input variables (T2m during winter storm and relative humidity during humid heatwave) and the necessity of having to substitute relative humidity at the surface level seem to outweigh this effect. Nevertheless, the described problem is an interesting target for future work, as impacts often are not simply determined by linear combinations of the variables predicted by the ML models. Price et al. (2023) suggest using generative modeling to overcome this systematic problem.
For theoretically justified extrapolation to extremes and when interested in risk assessment, a natural approach is the use of extreme value statistics (Coles 2001). Recently, various approaches have combined machine learning and extreme value statistics to improve predictive extrapolation of extreme risk for the predicted variable (Pasche and Engelke 2024; Richards and Huser 2024; Velthoen et al. 2023; Cisneros et al. 2023; Allouche et al. 2024; Gnecco et al. 2024). Methods for extrapolation in the predictor space also exist but require stronger dependence assumptions (Shen and Meinshausen 2023; Pfister and Bühlmann 2024). Including physical domain knowledge in ML-based models, for example, through architectural restrictions of explicit equations, could be another approach to improving generalization (Kochkov et al. 2024).
Releasing raw predictions instead of aggregates or summaries, or even pretrained models, is valuable (Burnell et al. 2023). As considering all metrics that stakeholders deem important during model development and testing is challenging or impossible, releasing the full predicted data or trained models allows assessing domain-specific model skills even after the development of the model. WeatherBench 2 (Rasp et al. 2024) already partially addressed this point. Building a continuously updated database of extreme event case study setups (similar to ECMWF’s Severe Event Catalogue; ECMWF 2024), including domains and impact metrics, might be a valuable contribution to existing literature. A focus on reusability for new models would be important, potentially through the integration of a framework like WeatherBench 2. One could hypothesize that the selection of extreme events based on their real-world impacts leads to a “selection bias,” as the larger impacts might have been caused by poor forecasts by operational models, potentially resulting in a biased estimate of the relative performance of HRES.
The evaluation of ML models typically focuses on meteorological variables. Putting a stronger focus directly on impacts has the potential to improve the practical value of ML models. To find suitable impact metrics, researchers could, for instance, look at warnings issued by weather services and analyze how warnings based on NWP compare to forecasts using ML models. Coupling ML weather forecasts with impact models, such as models for floods (Nearing et al. 2024), crop loss, or fires, might also be valuable although the analysis would then also depend on the impact model’s fidelity. While ML models have shown impressive skill in forecasting key meteorological variables, it is worthwhile investigating whether their predictions can lead to similarly impressive results when assessing impacts.
Acknowledgments.
We thank Gloria Buriticá and Guohao Li for the discussions in early stages of the project and Lily-Belle Sweet for her feedback on this manuscript. We are grateful to the developers of FourCastNet, Pangu-Weather, and GraphCast for sharing their code and thank ECMWF for making their datasets publicly available, which has allowed us to conduct this study. S. E., O. C. P., and Z. Z. acknowledge funding from the Swiss National Science Foundation Eccellenza grant “Graph structures, sparsity and high-dimensional inference for extremes” (Grant 186858). J. W. acknowledges financial support from the Federal Ministry of Education and Research of Germany and by Sächsische Staatsministerium für Wissenschaft, Kultur und Tourismus in the program Center of Excellence for AI-research “Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig,” project identification ScaDS.AI. J. W. and J. Z. acknowledge the Helmholtz Initiative and Networking Fund (Young Investigator Group COMPOUNDX, Grant Agreement VH-NG-1537). S. E., O. C. P., J. W., Z. Z., and J. Z. conceptualized the study; O. C. P., J. W., and Z. Z. curated the data; O. C. P., J. W., and Z. Z. participated in the formal analysis; S. E. and J. Z. acquired the funding; O. C. P., J. W., and Z. Z. did the investigation; S. E., O. C. P., J. W., Z. Z., and J. Z. formulated the methodology; S. E.and J. Z. administered project; S. E. and J. Z. gathered the resources; O. C. P., J. W., and Z. Z. acquired software; S. E. and J. Z. supervised the study; O. C. P., J. W., and Z. Z. validated the results; O. C. P., J. W., and Z. Z. visualized the study; O. C. P., J. W., and Z. Z. prepared the original draft; S. E., O. C. P., J. W., Z. Z., and J. Z. reviewed and edited the manuscript; O. C. P. led the analysis and implementation of Pangu-Weather and FourCastNet; Z. Z. led the analysis and implementation of GraphCast; and J. W. led the case study data analysis.
Data availability statement.
We use data from the ECMWF products ERA5, HRES, and TIGGE, which are published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. ERA5 is available on the Copernicus Climate Data Store. HRES forecasts initialized at 0000/1200 UTC can be accessed through ECMWF’s TIGGE Data Retrieval portal. HRES forecasts initialized at 0600/1800 UTC are accessible through ECMWF’s MARS, which requires access to be granted. Recently, Rasp et al. (2024) published cloud-optimized versions of the ERA5 and HRES data. We used these datasets for the case studies in 2021 and accessed their versions of ERA5, ERA5 climatology, and HRES forecasts initialized at 0000/1200 UTC. The code to produce forecasts with GraphCast (https://github.com/google-deepmind/graphcast), Pangu-Weather (https://github.com/198808xc/Pangu-Weather), and FourCastNet (https://github.com/NVlabs/FourCastNet) is available. We published the preprocessed ground-truth data and model forecasts for the periods and regions studied (Pasche et al. 2024), under CC BY 4.0 licence. The code to reproduce the analyses and figures discussed in this work is available on https://github.com/jonathanwider/DLWP-eval-extremes (release v1.0).
APPENDIX A
Further Details and Analysis of the 2021 Pacific Northwest Heatwave
a. Computation of the root-mean-square error
In Fig. 2, two contours are determined by the long-term average HRES performance for 120 h (D5) and 240 h (D10) forecasts. To estimate these values, we use all HRES forecasts provided by WeatherBench 2 (Rasp et al. 2024), which at the time of the writing contain only initializations at 0000/1200 UTC between 1 January 2016 and 10 January 2023. We use HRES-fc0 as the ground truth and only consider predictions for days within a window size of 45 days around the day of the year of 28 June 2021. Numerical values for the grid boxes closest to the three investigated cities are provided in Table A1.
Long-term average RMSE values of HRES predictions for lead times of 120 h (D5) and 240 h (D10), computed as described in appendix A, section a.
b. Additional figures
In this subsection, we show the additional Figs. A1–A3 for our analysis of the 2021 Pacific Northwest heatwave.
Evolution of the T2m prediction RMSE with a lead time for the three ML models and HRES in the 2021 Pacific Northwest heatwave region during the summer of 2020 (1 Jun–31 Jul) as a baseline year. Observations in the considered box region, 45°–52°N, 119°–123°W, are weighted to correct for differences in gridcell area. Both ML models and HRES use 0000/1200 UTC initial conditions and evaluation times only, as opposed to Fig. 3, since the forecasts were downloaded from WeatherBench 2 for computational reasons, where only these initializations are available.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
Average temperature anomaly predicted for 27–29 Jun 2021 (inclusive). All anomalies including those for HRES are calculated with respect to the ERA5 climatology given in Rasp et al. (2024). The fact that HRES anomalies are computed against ERA5 data explains the patchy small-scale structure visible in the HRES panels. Forecasts are initialized at 0000 UTC on the day specified in the row title.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
Predictability barrier plots after taking the (left) argmin and (right) argmax over the RMSE (K) of the different models. This means that the choice of the color bar in each pixel of the left panel indicates which model had the lowest RMSE for the given lead time and validity date; in the right panel, it shows which model had the largest RMSE.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
APPENDIX B
Further Details and Analysis of the 2023 South Asian Humid Heatwave
a. Shapefiles
In section 3b, we use shapefiles to subset the study regions of the South Asian humid heatwave. We use country boundaries from the “World Administrative Boundaries—Countries and Territories” (Open Government License 3.0, https://public.opendatasoft.com/explore/dataset/world-administrative-boundaries/information/) dataset by the World Food Programme, and for the India–Bangladesh region, we additionally use the (1976–2000) map of “World Maps of the Köppen–Geiger Climate Classification” (Creative Commons Attribution 4.0, https://datacatalog.worldbank.org/search/dataset/0042325). We only include grid cells with Köppen–Geiger Class A (“tropical”) in the India–Bangladesh region.
b. Relative humidity
Relative humidity is required to compute the heat index, but Pangu-Weather (Bi et al. 2023) and GraphCast (Lam et al. 2023) only produce specific humidity, and only on atmospheric pressure levels, not at the surface. Therefore, we first need to compute relative humidity from specific humidity. We exclude FourCastNet (Pathak et al. 2022) from the case study in section 3b because here humidity is only available at pressure levels higher than 850 hPa.
c. Heat index
As described in section 3b, we follow Zachariah et al. (2023) during the computation of the heat index in using a modified version of the heat index (Rothfusz 1990) used by the NOAA Weather Prediction Center (WPC). The NOAA WPC formulation is accessible online (https://www.wpc.ncep.noaa.gov/html/heatindex\_equation.shtml) (last accessed 5 April 2024, page last modified 1937:55 UTC 12 May 2022).
In the following, the heat index formula is given for temperature in degrees Fahrenheit. We transform the temperature input to degrees Fahrenheit and the heat index output to degrees Celsius in the study.
HI values are classified as in Zachariah et al. (2023); the classes are listed with potential health consequences in Table B1.
d. Additional figures
In this subsection, we show the additional Figs. B1–B3 for our analysis of the 2023 South Asian humid heatwave.
Forecast error for RH1000hPa at the time step of each day when observed HI peaked in the corresponding ground truth dataset, averaged over 17–20 Apr.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
Forecast error for T2m at the time step of each day when observed HI peaked in the corresponding ground truth dataset, averaged over 17–20 Apr.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
Category of mean daily maximum HI (see Table B1) predicted for 17–20 Apr with different models and for varying initialization times computed using RH1000hPa. Forecasts are started at 0000 UTC on the specified initial date. The categories in ground truth datasets are shown in the bottom panels (labeled GT1 and GT3). For ML models, HI is computed using RH1000hPa, while for all other panels, we use RHsfc.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
APPENDIX C
Further Analysis of the 2021 North American Winter Storm
Additional figures
In this subsection, we show the additional Figs. C1–C3 for our analysis of the 2021 North American winter storm.
ERA5 and HRES-fc0 T2m and wind chill index Twc time series in the grid box closest to College Station, Texas. Weather station data: Easterwood Field, College Station, Texas. Data are retrieved from the integrated surface dataset (ISD) (Smith et al. 2011). This figure ignores thresholds in the definition of Twc for better readability.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
(a)–(d) The T2m prediction errors for different validity and lead times in the grid cell closest to College Station, Texas (UTC time used). (e) Ground truth T2m time series in the same grid cell.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
(a)–(d) Prediction errors for surface wind speed for different validity and lead times in the grid cell closest to College Station, Texas (UTC time used). (e) Ground truth wind speed time series in the same grid cell.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0033.1
REFERENCES
Allen, S., D. Ginsbourger, and J. Ziegel, 2023: Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA J. Uncertainty Quantif., 11, 906–940, https://doi.org/10.1137/22M1532184.
Allouche, M., S. Girard, and E. Gobet, 2024: Estimation of extreme quantiles from heavy-tailed distributions with neural networks. Stat. Comput., 34, 12, https://doi.org/10.1007/s11222-023-10331-2.
Andrychowicz, M., L. Espeholt, D. Li, S. Merchant, A. Merose, F. Zyda, S. Agrawal, and N. Kalchbrenner, 2023: Deep learning for day forecasts from sparse observations. arXiv, 2306.06079v3, https://doi.org/10.48550/arXiv.2306.06079.
Bartusek, S., K. Kornhuber, and M. Ting, 2022: 2021 North American heatwave amplified by climate change-driven nonlinear interactions. Nat. Climate Change, 12, 1143–1150, https://doi.org/10.1038/s41558-022-01520-4.
Bauer, P., 2024: What if? Numerical weather prediction at the crossroads. arXiv, 2407.03787v2, https://doi.org/10.48550/arXiv.2407.03787.
Ben Bouallègue, Z., and Coauthors, 2024: The rise of data-driven weather forecasting: A first statistical assessment of machine learning–based weather forecasts in an operational-like context. Bull. Amer. Meteor. Soc., 105, E864–E883, https://doi.org/10.1175/BAMS-D-23-0162.1.
Bi, K., L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian, 2023: Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619, 533–538, https://doi.org/10.1038/s41586-023-06185-3.
Blazejczyk, K., Y. Epstein, G. Jendritzky, H. Staiger, and B. Tinz, 2012: Comparison of UTCI to selected thermal indices. Int. J. Biometeor., 56, 515–535, https://doi.org/10.1007/s00484-011-0453-2.
Bonavita, M., 2024: On some limitations of current Machine Learning weather prediction models. Geophys. Res. Lett., 51, e2023GL107377, https://doi.org/10.1029/2023GL107377.
Bülte, C., N. Horat, J. Quinting, and S. Lerch, 2024: Uncertainty quantification for data-driven weather models. arXiv, 2403.13458v1, https://doi.org/10.48550/arXiv.2403.13458.
Burnell, R., and Coauthors, 2023: Rethink reporting of evaluation results in AI. Science, 380, 136–138, https://doi.org/10.1126/science.adf6369.
Buzan, J. R., and M. Huber, 2020: Moist heat stress on a hotter Earth. Annu. Rev. Earth Planet. Sci., 48, 623–655, https://doi.org/10.1146/annurev-earth-053018-060100.
Charlton-Perez, A. J., and Coauthors, 2024: Do AI models produce better weather forecasts than physics-based models? A quantitative evaluation case study of Storm Ciarán. npj Climate Atmos. Sci., 7, 93, https://doi.org/10.1038/s41612-024-00638-w.
Chen, K., and Coauthors, 2023: FengWu: Pushing the skillful global medium-range weather forecast beyond 10 days lead. arXiv, 2304.02948v1, https://doi.org/10.48550/arXiv.2304.02948.
Chen, L., X. Zhong, F. Zhang, Y. Cheng, Y. Xu, Y. Qi, and H. Li, 2023: FuXi: A cascade machine learning forecasting system for 15-day global weather forecast. npj Climate Atmos. Sci., 6, 190, https://doi.org/10.1038/s41612-023-00512-1.
Cisneros, D., J. Richards, A. Dahal, L. Lombardo, and R. Huser, 2023: Deep graphical regression for jointly moderate and extreme Australian wildfires. arXiv, 2308.14547v2, https://doi.org/10.48550/arXiv.2308.14547.
Coles, S., 2001: An Introduction to Statistical Modeling of Extreme Values. Springer, 209 pp.
ECMWF, 2024: Severe event catalogue – Forecast user – ECMWF confluence wiki. ECMWF, accessed 8 April 2024, https://confluence.ecmwf.int/display/FCST/Severe+Event+Catalogue.
Espeholt, L., and Coauthors, 2022: Deep learning for twelve hour precipitation forecasts. Nat. Commun., 13, 5145, https://doi.org/10.1038/s41467-022-32483-x.
Gasparrini, A., and Coauthors, 2015: Mortality risk attributable to high and low ambient temperature: A multicountry observational study. Lancet, 386, 369–375, https://doi.org/10.1016/S0140-6736(14)62114-0.
Gnecco, N., E. M. Terefe, and S. Engelke, 2024: Extremal random forests. J. Amer. Stat. Assoc., 119, 3059–3072, https://doi.org/10.1080/01621459.2023.2300522.
Gneiting, T., and M. Katzfuss, 2014: Probabilistic forecasting. Annu. Rev. Stat. Appl., 1, 125–151, https://doi.org/10.1146/annurev-statistics-062713-085831.
Gruber, K., T. Gauster, G. Laaha, P. Regner, and J. Schmidt, 2022: Profitability and investment risk of Texan power system winterization. Nat. Energy, 7, 409–416, https://doi.org/10.1038/s41560-022-00994-y.
Hastie, T., R. Tibshirani, and J. Friedman, 2009: The Elements of Statistical Learning. Springer, 745 pp.
Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 1999–2049, https://doi.org/10.1002/qj.3803.
Kochkov, D., and Coauthors, 2024: Neural general circulation models for weather and climate. Nature, 632, 1060–1066, https://doi.org/10.1038/s41586-024-07744-y.
Lam, R., and Coauthors, 2023: Learning skillful medium-range global weather forecasting. Science, 382, 1416–1421, https://doi.org/10.1126/science.adi2336.
Lang, S., and Coauthors, 2024: AIFS—ECMWF’s data-driven forecasting system. arXiv, 2406.01465v2, https://doi.org/10.48550/arXiv.2406.01465.
Lavers, D. A., A. Simmons, F. Vamborg, and M. J. Rodwell, 2022: An evaluation of ERA5 precipitation for climate monitoring. Quart. J. Roy. Meteor. Soc., 148, 3152–3165, https://doi.org/10.1002/qj.4351.
Leinonen, J., U. Hamann, D. Nerini, U. Germann, and G. Franch, 2023: Latent diffusion models for generative precipitation nowcasting with accurate uncertainty quantification. arXiv, 2304.12891v1, https://doi.org/10.48550/arXiv.2304.12891.
Lerch, S., T. L. Thorarinsdottir, F. Ravazzolo, and T. Gneiting, 2017: Forecaster’s dilemma: Extreme events and forecast evaluation. Stat. Sci., 32, 106–127, https://doi.org/10.1214/16-STS588.
Lin, H., R. Mo, and F. Vitart, 2022: The 2021 western North American heatwave and its subseasonal predictions. Geophys. Res. Lett., 49, e2021GL097036, https://doi.org/10.1029/2021GL097036.
Lo, Y. T. E., and Coauthors, 2023: Optimal heat stress metric for modelling heat-related mortality varies from country to country. Int. J. Climatol., 43, 5553–5568, https://doi.org/10.1002/joc.8160.
Lopez-Gomez, I., A. McGovern, S. Agrawal, and J. Hickey, 2023: Global extreme heat forecasting using neural weather models. Artif. Intell. Earth Syst., 2, e220035, https://doi.org/10.1175/AIES-D-22-0035.1.
Magnusson, L., 2023: Exploring machine-learning forecasts of extreme weather. ECMWF Newsletter, No. 176, Reading, United Kingdom, 8–9, https://www.ecmwf.int/en/newsletter/176/news/exploring-machine-learning-forecasts-extreme-weather.
NWS, 2021: Valentine’s week winter outbreak 2021: Snow, ice, & record cold. NOAA’s National Weather Service, accessed 26 January 2024, https://www.weather.gov/hgx/2021ValentineStorm.
Neal, E., C. S. Y. Huang, and N. Nakamura, 2022: The 2021 Pacific Northwest heat wave and associated blocking: Meteorology and the role of an upstream cyclone as a diabatic source of wave activity. Geophys. Res. Lett., 49, e2021GL097699, https://doi.org/10.1029/2021GL097699.
Nearing, G., and Coauthors, 2024: Global prediction of extreme floods in ungauged watersheds. Nature, 627, 559–563, https://doi.org/10.1038/s41586-024-07145-1.
Nguyen, T., J. Brandstetter, A. Kapoor, J. K. Gupta, and A. Grover, 2023a: ClimaX: A foundation model for weather and climate. arXiv, 2301.10343v5, https://doi.org/10.48550/arXiv.2301.10343.
Nguyen, T., and Coauthors, 2023b: Scaling transformer neural networks for skillful and reliable medium-range weather forecasting. arXiv, 2312.03876v2, https://doi.org/10.48550/arXiv.2312.03876.
Olivetti, L., and G. Messori, 2024a: Advances and prospects of deep learning for medium-range extreme weather forecasting. Geosci. Model Dev., 17, 2347–2358, https://doi.org/10.5194/gmd-17-2347-2024.
Olivetti, L., and G. Messori, 2024b: Do data-driven models beat numerical models in forecasting weather extremes? A comparison of IFS HRES, Pangu-weather and GraphCast. Geosci. Model Dev., 17, 7915–7962, https://doi.org/10.5194/gmd-17-7915-2024.
Osczevski, R., and M. Bluestein, 2005: The new wind chill equivalent temperature chart. Bull. Amer. Meteor. Soc., 86, 1453–1458, https://doi.org/10.1175/BAMS-86-10-1453.
Owens, R., and T. Hewson, 2018: ECMWF forecast user guide. ECMWF Newsletter, No. 156, ECMWF, Reading, United Kingdom, https://doi.org/10.21957/M1CS7H.
Pasche, O. C., and S. Engelke, 2024: Neural networks for extreme quantile regression with an application to forecasting of flood risk. Ann. Appl. Stat., 18, 2818–2839, https://doi.org/10.1214/24-AOAS1907.
Pasche, O. C., J. Wider, Z. Zhang, J. Zscheischler, and S. Engelke, 2024: Data release: Validating deep-learning weather forecast models on recent high-impact extreme events (version 1.0). Zenodo, https://doi.org/10.5281/zenodo.14358212.
Pathak, J., and Coauthors, 2022: FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv, 2202.11214v1, https://doi.org/10.48550/arXiv.2202.11214.
Pfister, N., and P. Bühlmann, 2024: Extrapolation-aware nonparametric statistical inference. arXiv, 2402.09758v2, https://doi.org/10.48550/arXiv.2402.09758.
Philip, S. Y., and Coauthors, 2022: Rapid attribution analysis of the extraordinary heat wave on the Pacific coast of the US and Canada in June 2021. Earth Syst. Dyn., 13, 1689–1713, https://doi.org/10.5194/esd-13-1689-2022.
Price, I., and Coauthors, 2023: GenCast: Diffusion-based ensemble forecasting for medium-range weather. arXiv, 2312.15796v2, https://doi.org/10.48550/arXiv.2312.15796.
Rasp, S., and N. Thuerey, 2021: Data-driven medium-range weather prediction with a Resnet pretrained on climate simulations: A new model for WeatherBench. J. Adv. Model. Earth Syst., 13, e2020MS002405, https://doi.org/10.1029/2020MS002405.
Rasp, S., P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, 2020: WeatherBench: A benchmark data set for data-driven weather forecasting. J. Adv. Model. Earth Syst., 12, e2020MS002203, https://doi.org/10.1029/2020MS002203.
Rasp, S., and Coauthors, 2024: WeatherBench 2: A benchmark for the next generation of data-driven global weather models. J. Adv. Model. Earth Syst., 16, e2023MS004019, https://doi.org/10.1029/2023MS004019.
Richards, J., and R. Huser, 2024: Regression modelling of spatiotemporal extreme U.S. wildfires via partially-interpretable neural networks. arXiv, 2208.07581v4, https://doi.org/10.48550/arXiv.2208.07581.
Rothfusz, L. P., 1990: The heat index equation (or, more than you ever wanted to know about heat index). NOAA, NWS, Office of Meteorology Tech. Attachment SR/SSD 9023, 395 pp., https://www.scirp.org/reference/referencespapers?referenceid=1038111.
Röthlisberger, M., and L. Papritz, 2023: Quantifying the physical processes leading to atmospheric hot extremes at a global scale. Nat. Geosci., 16, 210–216, https://doi.org/10.1038/s41561-023-01126-1.
Schumacher, D. L., M. Hauser, and S. I. Seneviratne, 2022: Drivers and mechanisms of the 2021 Pacific Northwest heatwave. Earth’s Future, 10, e2022EF002967, https://doi.org/10.1029/2022EF002967.
Selz, T., and G. C. Craig, 2023: Can artificial intelligence-based weather prediction models simulate the butterfly effect? Geophys. Res. Lett., 50, e2023GL105747, https://doi.org/10.1029/2023GL105747.
Seneviratne, S. I., and Coauthors, 2023: Weather and climate extreme events in a changing climate. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 1513–1766, https://doi.org/10.1017/9781009157896.013.
Shen, X., and N. Meinshausen, 2023: Engression: Extrapolation through the lens of distributional regression. arXiv, 2307.00835v3, https://doi.org/10.48550/arXiv.2307.00835.
Smith, A., N. Lott, and R. Vose, 2011: The integrated surface database: Recent developments and partnerships. Bull. Amer. Meteor. Soc., 92, 704–708, https://doi.org/10.1175/2011BAMS3015.1.
Velthoen, J., C. Dombry, J.-J. Cai, and S. Engelke, 2023: Gradient boosting for extreme quantile regression. Extremes, 26, 639–667, https://doi.org/10.1007/s10687-023-00473-x.
Watson, P. A. G., 2022: Machine learning applications for weather and climate need greater focus on extremes. Environ. Res. Lett., 17, 111004, https://doi.org/10.1088/1748-9326/ac9d4e.
Weyn, J. A., D. R. Durran, R. Caruana, and N. Cresswell-Clay, 2021: Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models. J. Adv. Model. Earth Syst., 13, e2021MS002502, https://doi.org/10.1029/2021MS002502.
White, R. H., and Coauthors, 2023: The unprecedented Pacific Northwest heatwave of June 2021. Nat. Commun., 14, 727, https://doi.org/10.1038/s41467-023-36289-3.
Zachariah, M., and Coauthors, 2023: Extreme humid heat in South Asia in April 2023, largely driven by climate change, detrimental to vulnerable and disadvantaged communities. Imperial College London Tech. Rep., 45 pp., https://doi.org/10.25561/104092.
Zeder, J., S. Sippel, O. C. Pasche, S. Engelke, and E. M. Fischer, 2023: The effect of a short observational record on the statistics of temperature extremes. Geophys. Res. Lett., 50, e2023GL104090, https://doi.org/10.1029/2023GL104090.
Zscheischler, J., and Coauthors, 2020: A typology of compound weather and climate events. Nat. Rev. Earth Environ., 1, 333–347, https://doi.org/10.1038/s43017-020-0060-z.