Statistical Modeling of 2-m Temperature and 10-m Wind Speed Forecast Errors

Zied Ben Bouallègue aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Zied Ben Bouallègue in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0003-2914-4203
,
Fenwick Cooper bUniversity of Oxford, Oxford, United Kingdom

Search for other papers by Fenwick Cooper in
Current site
Google Scholar
PubMed
Close
,
Matthew Chantry aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Matthew Chantry in
Current site
Google Scholar
PubMed
Close
,
Peter Düben aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Peter Düben in
Current site
Google Scholar
PubMed
Close
,
Peter Bechtold aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Peter Bechtold in
Current site
Google Scholar
PubMed
Close
, and
Irina Sandu aEuropean Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by Irina Sandu in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

Based on the principle “learn from past errors to correct current forecasts,” statistical postprocessing consists of optimizing forecasts generated by numerical weather prediction (NWP) models. In this context, machine learning (ML) offers state-of-the-art tools for training statistical models and making predictions based on large datasets. In our study, ML-based solutions are developed to reduce forecast errors of 2-m temperature and 10-m wind speed of the ECMWF’s operational medium-range, high-resolution forecasts produced with the Integrated Forecasting System (IFS). IFS forecasts and other spatiotemporal indicators are used as predictors after careful selection with the help of ML interpretability tools. Different ML approaches are tested: linear regression, random forest decision trees, and neural networks. Statistical models of systematic and random errors are derived sequentially where the random error is defined as the residual error after bias correction. In terms of output, bias correction and forecast uncertainty prediction are made available at any point from locations around the world. All three ML methods show a similar ability to capture situation-dependent biases leading to noteworthy performance improvements (between 10% and 15% improvement in terms of root-mean-square error for all lead times and variables), and a similar ability to provide reliable uncertainty predictions.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Zied Ben Bouallègue, zied.benbouallegue@ecmwf.com

Abstract

Based on the principle “learn from past errors to correct current forecasts,” statistical postprocessing consists of optimizing forecasts generated by numerical weather prediction (NWP) models. In this context, machine learning (ML) offers state-of-the-art tools for training statistical models and making predictions based on large datasets. In our study, ML-based solutions are developed to reduce forecast errors of 2-m temperature and 10-m wind speed of the ECMWF’s operational medium-range, high-resolution forecasts produced with the Integrated Forecasting System (IFS). IFS forecasts and other spatiotemporal indicators are used as predictors after careful selection with the help of ML interpretability tools. Different ML approaches are tested: linear regression, random forest decision trees, and neural networks. Statistical models of systematic and random errors are derived sequentially where the random error is defined as the residual error after bias correction. In terms of output, bias correction and forecast uncertainty prediction are made available at any point from locations around the world. All three ML methods show a similar ability to capture situation-dependent biases leading to noteworthy performance improvements (between 10% and 15% improvement in terms of root-mean-square error for all lead times and variables), and a similar ability to provide reliable uncertainty predictions.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Zied Ben Bouallègue, zied.benbouallegue@ecmwf.com

1. Introduction

Near-surface temperature and wind speed are key variables in many weather applications, but numerical weather prediction (NWP) systems struggle in producing bias-free forecasts of such quantities, even at short lead times. In particular, long-standing biases affect the operational medium-range forecasts of 2-m temperature and 10-m wind speed produced with the Integrated Forecasting System (IFS) of the European Centre for Medium-Range Forecasts (ECMWF), as illustrated in Fig. 1.

Fig. 1.
Fig. 1.

The 48-h forecast performance over Europe (35°–75°N, 12.5°W–42.5°E) of IFS (left) 2-m temperature and (right) 10-m wind speed over the last decade, for ECMWF’s operational high-resolution forecasts with respect to SYNOP observations. The blue (root-mean-square error) and red (mean error) lines indicate the calculation as performed by Haiden et al. (2021), while the respective dots represent the equivalent calculation performed with the data quality control criteria used here (see text). The black dots indicate the resulting errors after postprocessing by the best ML-based models derived here (see text for details).

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

Recent investigations of ECMWF’s near-surface forecast biases shed new light on potential sources of forecast errors and paved the way for ongoing and future model developments for the IFS (Sandu et al. 2020). At the same time, statistical postprocessing offers a pragmatic way to correct systematic errors. By comparing forecasts with in situ observations, statistical models learn from past errors to derive corrections to be applied to future forecasts. Hemri et al. (2014) showed that the expected benefit of postprocessing does not vary year after year, suggesting that benefits from postprocessing and benefits from NWP model improvements are complementary.

In this study, statistical postprocessing of IFS forecasts is investigated with a focus on the ECMWF’s operational deterministic high-resolution forecasts of 2-m temperature and 10-m wind speed. More specifically, we assess and predict systematic and residual forecast errors using machine learning (ML) tools. The following semantics is used throughout the text: systematic errors refer to differences between forecasts and observations that can be corrected for by postprocessing through bias correction, while residual errors refer to the remaining forecast errors after bias correction.

ML provides a general framework for applying complex statistical methods to large datasets that finds natural applications in the postprocessing of weather forecasts (Düben et al. 2021). Recent developments of ML software libraries such as scikit-learn in Python programming language (https://scikit-learn.org) greatly facilitate the take-on of state-of-the-art ML methods. Moreover, advances in ML interpretability (McGovern et al. 2019) provide suitable tools to initiate positive feedback loops between NWP model developers and postprocessing experts. Here, our ML-based postprocessing applications intend to:

  • capture bias patterns and estimate forecast uncertainty,

  • compare the performance of different postprocessing methods, and

  • help identify sources of errors,

in the context of global forecasting of surface weather variables.

Postprocessing of global forecasts requires large datasets in order to provide relevant contextual information about the forecast to be corrected. So-called predictors help distinguish between different situations (in a static or a dynamic sense) leading, on average, to over- or underprediction, and, on average, to a large or a small forecast error. The general strategy consists in including a variety of predictors as input to the ML models: NWP model output (such as the forecast surface pressure), model characteristics (such as the model orography), and spatiotemporal indicators (such as the day of the year). ML algorithms are designed to find useful relationships between the predictand (here the forecast error) and the diverse sets of predictors. The use of such ML approaches for successful weather forecasting applications have been documented in recent years:

  1. linear regression techniques for the postprocessing of ensemble solar radiation forecasts over Germany (Ben Bouallègue 2017),

  2. decision trees from random forests for the postprocessing of temperature and wind speed forecast over France (Taillardat et al. 2016), and

  3. neural networks for the postprocessing of ensemble temperature forecasts over Germany (Rasp and Lerch 2018),

to cite a few examples in an effervescent field of research. The interested reader can find an overview of postprocessing techniques and recent developments in this research area in Vannitsem et al. (2021).

Here, we propose to test and compare three statistical methods: linear regression (LR), random forests (RF), and neural networks (NN). The goal is to provide statistically postprocessed forecasts at any location over the globe based on 2-m temperature and 10-m wind speed IFS forecasts. In contrast with previous studies, systematic and residual errors are treated sequentially rather than at the same time. Additionally, we explore the benefit and impact of using postprocessing configurations where input data consists of static predictors (e.g., non-state-dependent) and time indicators only. Finally, following a suggestion in Hamill (2021), the combination of the different statistical models is also tested.

The remaining of the manuscript is organized as follows: section 2 details the data used in this study, the statistical models are described in section 3, and the selection of predictors in section 4. The results are presented and discussed in section 5 before concluding in section 6.

2. Data

a. Forecasts, observations, and predictors

The forecasts of 2-m temperature and 10-m wind speed used in this study are the operational ECMWF high-resolution (∼9 km) 10-day global weather forecasts produced with ECMWF IFS (ECMWF 2020). The data are taken over a 2-yr period (September 2019–August 2021) from forecasts starting each day at 0000 UTC and with lead times up to 48 h, at 3-h intervals.

Observations are measurements at synoptic weather stations (SYNOP) received through the World Meteorological Organization (WMO) Global Telecommunications System (GTS). A stringent quality control is applied before the use of the observation measurements in both ML training and forecast verification exercises. A detailed description of the quality control process is provided in appendixes A and B. The number of station measurements available varies from day to day, but after quality control the observation dataset relies on 2185 stations reporting 2-m temperature and 1799 stations reporting 10-m wind measurements. The stations are distributed heterogeneously around the world as illustrated in section 5. For each weather station, the 2-m temperature and 10-m wind are taken from the forecast at the nearest neighboring point of a measurement station.

There is a difference between the height of model orography at a station location and the true height of the station. When comparing the forecasts and observations, the standard 2-m temperature forecast correction corresponds to a linear reduction in temperature with height (a lapse rate) of 6.5°C km−1 while taking the nearest neighboring point in the model grid as the model elevation. This approach is considered as the default bias correction in the following.

We consider a variety of potential predictors for our ML experimentations. Among them, four predictors are derived from heat and radiation fields: surface solar radiation, surface thermal radiation, sensible heat, and latent heat fluxes. As these quantities are stored as cumulative quantities, they are converted to instantaneous fluxes using first-order finite differences. The full list of predictors is provided in Table 1.

Table 1

Predictors list, classification, and use in different configurations: selected (∘), not selected (–), and not tested (⋅).

Table 1

We test two types of model configurations:

  1. a “state-dependent” configuration where the current forecast and any other model output can be used as a predictor (i.e., there is no self-imposed restrictions on the use of predictors),

  2. a “state-independent” configuration where only predictors available before the start of the forecast-of-the-day are used.

We distinguish three types of predictors: state-dependent predictors, which are direct model outputs that differ for each forecast, static predictors that describe constant characteristics of the model surface, and time indicators (see the classification in Table 1). In a state-independent configuration, input data only include static predictors (such as the model orography) and time indicators (such as the day of the year) offering a model of the forecast errors independent of the forecast of the day.

b. Training, verification, and test data

We use data over the 2-yr period 1 September 2019–31 August 2021, and the data are split into three segments: training, verification, and test. The training data are the portion of the data that the ML-based models are fit to. We use 1 year for the training data, 1 September 2019–31 August 2020 and half of the stations (even-numbered, with numbers randomly attributed). The verification data are the portion of the data that is reserved for optimization of all free model parameters, often called hyperparameters. For example, we do not know the number of trees to use in a random forest, the number of neurons to use in a neural network, or the step size in an iterative descent. We also use even-numbered stations for the verification data. The test data are reserved for the end to finally test model predictions of data that they have not yet seen. We conduct four experiments with different validation and test datasets (see Table 2).

Table 2

Data split in terms of verification and test datasets. MAM: March–April–May, SON: September–October–November, DJF: December–January–February, JJA: June–July–August.

Table 2

In addition to the split as a function of the date, test data are taken from odd-numbered stations while even-numbered stations are used for training and verification. This scheme ensures that there is no overlap between training/verification and test data. When discussing the results in section 5, we focus on a summer season (experiment 2) and a winter season (experiment 4) only.

3. ML-based models

We want to model the difference between a forecast denoted f, and the corresponding measurement at a weather station denoted o. We consider three ML models and, for each model, two configurations based on the chosen pool of predictors, and finally a combination of models for each configuration. More explicitly, we perform the following for 2-m temperature and 10-m wind independently:

  1. Test three types of ML methods: linear regression (LR), random forests (RF), and neural networks (NN), see sections 3a, 3b, and 3c, respectively.

  2. For each method, consider two model configurations depending on the pool of predictors we select from: a state-dependent configuration and a state-independent configuration, see section 3a.

  3. For each configuration, select a subset of predictors (optimized by trial and error testing and with the help of ML interpretability tools that assess predictors importance) and use this subset consistently with all three ML methods, see section 3a.

  4. For each method and configuration, fit two models: the first one to correct the forecast, and the second for uncertainty quantification. In practical terms, “Model 1” is fitted to the raw forecast error fo to predict the systematic error while “Model 2” is fitted to the remaining error (f^o)2 after bias correction with f^ the bias-corrected forecast to predict the residual error, see Fig. 2.

  5. For each method and configuration, take the average of the three ML predictions to make a combined prediction.

Fig. 2.
Fig. 2.

Illustrative example of the problem at hand based on synthetic data. (a) We are first interested in predicting the forecast error e = fo as a function of a predictor x. Model 1 provides the best estimate of the error as a function of x as represented by the solid red line. The residual error corresponds to the squared distance between the red line and the black dots. (b) As a second step, Model 2 is built to capture the residual error e=(f^o)2 as a function of x. The resulting estimated residual error is represented by the red line. In this simple example, Models 1 and 2 are both linear regression fits with two parameters each.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

In our approach, we use two distinct statistical models for the representation of systematic errors on the one hand and the representation of residual errors on the other hand. Model 1 focuses on the raw forecast error denoted e defined as e = fo. The estimated systematic error, i.e., the output of Model 1, is denoted e^. Model 2 focuses on the residual error denoted ξ, defined as ξ=(e^e)2. The residual error can be expressed as the squared difference between corrected forecast and observation:
ξ=(e^e)2
=[e^(fo)]2
=(f^o)2,
where f^=fe^ is the bias-corrected forecast. The quadratic model for the residual error is chosen for its resemblance with the forecast variance as a measure of forecast uncertainty. Other choices could be appropriate but this aspect is not further explored here. Using synthetic data, a simple example of how Model 1 and Model 2 work in practice is provided in Fig. 2.

The choice of a two-model approach is motivated by the fact that no assumptions about the form of the underlying forecast probability distribution are required in that case. Such assumptions are for example required when using parametric methods which target the optimization of a probabilistic score such as the continuous ranked probability score (CRPS). Moreover, with our two-model approach, ML interpretability tools can be beneficially applied to each model (Model 1 and Model 2) separately as illustrated below in section 3b. With traditional nonparametric methods relying on analog forecasts, for example, it is difficult to distinguish between sources of different error types (systematic and residual).

Besides the standard state-dependent configuration, we also test here a state-independent configuration for each model. The idea is sparked by the ML interpretability results indicating that static predictors and time indicators are “important” features, i.e., among top-ranked predictors, in particular for 10-m wind speed predictions (see again section 3b). In addition, the foreseen advantages of using state-independent configurations in research or operational settings are multiple: building large training datasets is simple and fast, no critical time processing is involved as all operations are performed offline, the possibility to check the statistical model output before dissemination is an asset. Besides, when using a state-independent configuration, the estimated systematic and residual errors are location and time-specific but independent of the forecast-of-the-day.

The configuration setup can be summarized as follows. The bias-corrected forecast f^ is derived as the difference between the raw IFS forecast f and the error estimate e^ applying one of the two following configurations:
State-dependentconfiguration:f^=fe^(d,s,t)
State-independentconfiguration:f^=fe^(s,t)
where d are state-dependent predictors, s are static predictors, and t are time indicators. In both cases, the bias-corrected forecast is a function of the IFS forecast of the day, but the error correction part is weather dependent only in the state-dependent configuration. The exact form of the function e^() depends on the ML method applied as described below.

Following a global approach, a single model is trained for all stations and predictions can be made at any points on the globe. A single model is also valid for all lead times. The forecast lead time is not included as a predictor as a result of the predictor selection process (see section 3a). In our experiments, the same type of ML method is applied for both Model 1 and Model 2. We also check whether a simple pooling of the predictions from the three models (LR, RF, and NN) improves the forecast performance. Indeed, different models that represent different types of regression functions are expected to capture different aspects of the error characteristics.

a. Linear regression

To fit a function using linear regression, the functional form of the fit coefficients that we are trying to find must be linear. For example, consider the following:
e^(x)=c0+c1x+c2x2.
We have a list of values of x, the “predictor,” and a list of values of e, the “predictand” and we want to find the values of the fit coefficients, c0, c1, and c2, also known as the “parameters.” The function itself is nonlinear due to the x2 term, but each term is linear in the fit coefficients. Given values of x and e(x), ordinary least squares linear regression then finds the fit coefficients that minimize the squares of the differences between the function e^(x) and the values of e provided. See for example Press et al. (2007) for a more detailed explanation.
Here we consider quadratic functions of our predictors. With two predictors x and y and quadratic terms, a quadratic model takes the form:
e^(x,y)=c0+c1x+c2y+c3x2+c4xy+c5y2
with six unknown parameters to fit in this simple example. The precise number of unknown parameters depends on the number of predictors, for example 66 parameters with 10 predictors, 231 parameters with 20 predictors, and so on. In our experiments, the LR models include quadratic terms and interaction terms. In other terms, the product of each predictor with one another is considered as a new predictor.

b. Random forest

Random forest (RF) is a nonparametric technique that consists in building a collection of trees. Each tree is a decision tree that partitions a multidimensional dataset into successively smaller subdomains. Each partition of the data consists in splitting the data into two groups based on some threshold applied to one of the predictors. Predictors and thresholds are chosen in order to maximize the diversity of the response variable among the resulting groups. Each new group is itself split into two, and so on until some stopping criterion is reached. In a prediction situation, the current values of the predictors draw a path in the tree until a final leaf. The forecast takes the mean value of the response variable in the final group (leaf), that is the mean value of the training data within the region defined by the leaf.

Our implementation makes use of the scikit-learn Python library (Pedregosa et al. 2011). In the interest of computational time, training is performed on a subsample of the dataset by randomly selecting 1% of the total available training data. The models for the prediction of systematic and residual errors are trained on two different randomly selected subsamples. The main hyperparameters associated with the RF models are the number of trees and the maximum depth of each tree. The hyperparameters selected for the different experiments presented in this study are shown in Table 3.

Table 3

Hyperparameter settings for the different random forest models: sd is state dependent, and si is state independent.

Table 3

c. Neural network

Here we use a multilayer perceptron (MLP), also known as a fully connected neural network as our neural network design. We choose 4 hidden layers, with 32 hidden neurons, resulting in approximately 4000 trainable parameters (the precise number depends on the number of predictors used). For hidden layers, we use the Swish activation function, for output layers we use no activation function. We build and train these models using Tensorflow/Keras. Models are fitted using the Adam optimizer, with a learning rate of 10−3. We train to minimize the mean-squared error for 20 epochs (passes through the training set) with a batch size of 128. Early stopping, after the validation loss has failed to decrease for 6 epochs, and learning rate reduction (again based on the validation loss) are also employed, but the results were not found to be sensitive to these choices. We also explored increasing the number of trainable parameters, through increases in hidden neurons and hidden layers, but these increases did not return a noticeable reduction in losses on the testing dataset.

4. Predictor selection and ranking

a. Backward stepwise elimination

To select predictors we use backward stepwise elimination and linear regression of quadratic polynomials. The algorithm proceeds as follows:

  1. Each predictor is removed from the full list of n predictors one at a time and the regression is performed. We then have n regression models each fitting n − 1 predictors.

  2. Each of these regression models is then used to forecast the training data. The predictor that corresponds to the smallest reduction in RMSE between the model and the training data, is discarded. The list of predictors is then one shorter than when we started.

  3. The entire procedure is then repeated to find and remove the next least “important” predictor. This is repeated until there is one predictor left.

Applying the resulting models to predict the validation data indicates that backward elimination is sufficient, see Fig. 3. Note that if two predictors both contribute the same information, removing either one will not reduce the RMSE, and one of the two will be randomly selected as unimportant. The algorithm only considers it “important” to include one of these two predictors in the LR model. Some of our predictors are highly correlated, for example, the skin temperature and the temperature on the lowest model level, at the station locations, have a correlation coefficient of 0.97 over the training dataset.

Fig. 3.
Fig. 3.

RMSE of (left) 2-m temperature and (right) 10-m wind speed forecasts (against SYNOP observations) as a function of the number of predictors used. Results based on linear regression of quadratic polynomials for the summer test period only. For each plot, the top point is the RMSE of the default forecast and the following ones are after bias correction with a linear regression model with increasing incrementally the number of predictors. The order of the predictors is obtained using backward stepwise regression (see text).

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

Results in Fig. 3 serve as a basis for predictor selection of the models with no restrictions in the choice of predictors (contrary to the state-independent configurations). As a complementary tool, RF impurity importance, as discussed below, is also explored. Eventually, the final set of predictors is selected scrutinizing ML interpretability plots with a critical (human) eye. For example, wind speed components at the lowest model level are ranked poorly for 2-m temperature predictions in Fig. 3 but we included them in the list because they are considered important by RF models. Also, the land–sea mask is added to our list as deemed important from a practical point of view. The list of selected predictors for each configuration (state-dependent and state-independent) and for each variable (2-m temperature and 10-m wind speed) is detailed in Table 1.

b. Random forest impurity importance

The interpretability of RF models is facilitated by the so-called feature importance results. Indeed, RF algorithm allows identifying the more valuable predictors in the process of building decision trees. Predictor importance is measured by the mean decrease in impurity where impurity is measured by a metric proportional to the area under the relative operating characteristics curve: the Gini coefficient [see a definition in section 1.10.7.1 of the scikit-learn user guide (https://scikit-learn.org)]. The mean decrease in impurity corresponds to the total decrease in node impurity averaged over all trees (Louppe et al. 2013). It is worth noting that this measure favors predictors with high cardinality, i.e., predictors with many unique values. In our case, many of our predictors take a lower number of unique values than others (e.g., elevation is constant at each station). For this reason, backward stepwise regression is considered as our tool of choice for predictor selection.

Nevertheless, predictor impurity importance is key for the interpretability of RF models. The ranking of the predictors for the models of 2-m temperature and 10-m wind speed errors is shown in Figs. 4 and 5, respectively. Predictor importance for models of systematic errors (Model 1) and of residual errors (Model 2) are provided separately as two different RF models are trained consecutively. Our findings are the following:

Fig. 4.
Fig. 4.

Predictor importance for the prediction of 2-m temperature (left) systematic errors and (right) residual errors. Importance is estimated with RF mean decrease in impurity. The error bars indicate the intertrees variability.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

Fig. 5.
Fig. 5.

Predictor importance as in Fig. 4, but for 10-m wind speed.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

  • For 2-m temperature, boundary layer height forecasts, temperature forecasts at various levels in the atmosphere and the ground, and wind speed forecasts play together a key role in estimating 2-m temperature systematic errors.

  • For 10-m wind speed, besides predictors related to the wind itself, static predictors and time indicators appear particularly important. Verification results presented in section 5 confirm that state-independent configurations offer competitive solutions for 10-m wind speed predictions.

  • Predictor importance for residual errors is dominated by one or two predictors, namely, the forecast itself and, for 2-m temperature, the boundary layer height. These predictors are also important for systematic error prediction. This result suggests that, in a two-model approach, systematic error prediction could serve as a predictor for the prediction of residual errors (not tested here). It also points to residual errors being dominated by local state-dependent conditions.

These results in terms of predictor importance are consistent with previous studies at the regional scale. For example, the importance of temperature forecasts and local characteristics at station locations for the postprocessing of 2-m temperature forecast was already pointed out in Taillardat et al. (2016) and Rasp and Lerch (2018).

5. Verification results

a. Forecast bias

We first focus on the bias to assess the ability of the ML models to correct for systematic errors. Forecast bias (or mean error) is computed as the mean difference between forecasts and observations. We look at two types of score aggregation: one spatial aggregation leading to scoring as a function of the forecast lead time, and one temporal aggregation at each station location. Results of the former are presented in Fig. 6 while results of the latter are presented in Fig. 7 and Fig. 8 for 2-m temperature and 10-m wind speed, respectively.

Fig. 6.
Fig. 6.

Forecast bias as a function of the forecast lead time. Forecast of (left) 2-m temperature and (right) 10-m wind speed, in (top) winter and (bottom) summer. Zero bias is indicated by a black horizontal line.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

Fig. 7.
Fig. 7.

Forecast bias of 2-m temperature forecasts (left) before and (right) after correction using the combined model in (top) winter and (bottom) summer. Results are aggregated over all lead times.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

Fig. 8.
Fig. 8.

Forecast bias as in Fig. 7, but for 10-m wind speed.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

In Fig. 6, we compare forecast performance when applying the standard lapse rate correction as described in section 5a (default), linear regression (LR), random forest (RF), neural network (NN), and a combination of the three different ML models (Combined). At this stage we only show results for the state-dependent configurations (i.e., with no restriction on the choice of predictors). In Figs. 7 and 8, the maps focus on the results of the combined models only, aggregating results over all lead times, including step 0. For each plot, we distinguish winter and summer results.

The strong diurnal cycle of the bias almost disappears after postprocessing as shown in Fig. 6. The original daily cycle in these plots reflects the daily cycle of the forecast error over Europe where more stations are located. All three ML methods perform equally well on average. The bias of 2-m temperature forecast seems to slightly increase with lead time in summer. The forecast step could be included as a predictor to capture potential bias drift with forecast horizon if the intention is to apply models trained at short lead times for bias correction of forecasts at longer lead times. Interestingly, the bias of 10-m wind speed forecast exhibits the same pattern before and after postprocessing but with a significantly lower amplitude.

At the station level, when looking at each ML model separately, bias correction can perform differently for different models (not shown). The combined model approach benefits from this diversity. Overall, large reductions in forecast bias are visible in various regions of the world for both seasons and weather variables. For 2-m temperature in Fig. 7, we note a reduction of the large positive biases dominating the Northern Hemisphere and the negative biases along the tropics. For 10-m wind speed in Fig. 8, we see a clear reduction of the bias in eastern Europe, over the Indian subcontinent, and the South American continent.

The distribution of stations is uneven around the world as illustrated in Figs. 7 and 8. Data preprocessing, in the form of upscaling, could help to homogenize the data before training for systematic errors. In our experiments, ML models are biased toward Europe because of the higher number of station measurements available through GTS in this region. Regions with low data density could benefit from training on a larger area in order to increase the training data sample. However, reversely, training on European stations only improves postprocessed forecasts over Europe but degrades performance on a global scale (not shown).

b. Forecast accuracy

Forecast accuracy is mainly assessed with the root-mean-square error (RMSE). In addition to the LR, RF, NN, and combined predictions for the models with no restrictions in terms of predictor choice, we also show here results for the combined predictions of the simpler state-independent configurations. Indeed, we only show results for the combined state-independent predictions as the combination improves the performance with respect to any state-independent-based models taken separately (not shown).

Global RMSE averages as a function of the forecast lead time are shown in Fig. 9. RMSE is reduced by around 10%–15% for all lead times by all ML models with no self-imposed restrictions on the choice of predictors. The difference between ML models is much smaller than their difference to the default uncorrected forecast. For 2-m temperature, the NN model performs slightly better than for the others and for 10-m wind speed RF predictions are slightly better than the others. In all cases, the linear combination of models either slightly outperforms any single model or is extremely close. Changing the size and dates of the training data, quality control of the training data, and predictor selection appeared more important than the choice of the underlying ML method.

Fig. 9.
Fig. 9.

Forecast RMSE as a function of the forecast lead time. Forecast of (left) 2-m temperature and (right) 10-m wind speed, in (top) winter and (bottom) summer.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

Building models using only static predictors and time indicators emerges as an appealing approach to postprocessing: a substantial share of the expected RMSE improvement with postprocessing is achieved with these simple and cost-effective model configurations. In general terms, there is a trade-off between complexity and applicability in an operational context on the one hand and postprocessed forecast performance on the other hand. For example, the combination of models leads to better results than individual models alone but at the cost of multiplying the models to be trained and maintained and potentially a risk of inconsistencies in the resulting forecast. Similarly, the combination of models of different types (as illustrated above) leads to better results than combining variants of the same model, as for example RF with different hyperparameters (not shown).

Changes in RMSE are larger where RMSE errors are initially larger. There are great geographic variations in the RMSE of the default forecast for both 2-m temperature and 10-m wind speed forecasts (see Fig. B1). For example, the Alps are associated with an RMSE of around 4°C, while in northern France the RMSE is 1°–1.5°C. The broad pattern of change is the same for all ML models, with the larger reductions in RMSE being in regions of high forecast RMSE. For both variables, RMSE has reduced in most parts of the world with some exceptions where all statistical models do not perform well as for example in East Asia for 2-m temperature in winter or central Europe for 10-m wind speed also in winter (see Fig. 10). We believe an increase in the size of the training dataset could help in such situations.

Fig. 10.
Fig. 10.

Change in performance in terms of RMSE after bias correction. Blue colors indicate an improvement achieved with postprocessing. Forecasts of (left) 2-m temperature and (right) 10-m wind speed, in (top) winter and (bottom) summer. Results are aggregated over all lead times.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

Finally, we also assess the strength of the linear relationship between forecasts and observations. At each station, the correlation coefficient between forecasts and observations is computed when observation measurements are available over the whole verification period (to avoid computing correlation coefficients on a small number of forecast/observation pairs). The correlation coefficients are estimated at each station separately, so a better score after postprocessing indicates that the ML model captures conditional biases at the station level. By contrast, a constant bias correction at each station would have no impact on this performance metric. In Fig. 11, coefficients are aggregated at each lead time. These results are consistent with RMSE results shown in Fig. 9 and suggest that situation-dependent bias correction with ML techniques improves the ability of the forecast to capture the day-to-day weather variations.

Fig. 11.
Fig. 11.

As in Fig. 9, but for the aggregated correlation coefficients.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

c. Forecast uncertainty

Models for systematic and residual errors are developed sequentially (see section 3). A second model (Model 2) focuses on the residual forecast error after bias correction. The resulting prediction is called forecast uncertainty and aims to reflect the level of confidence one can have in a forecast. On average, large (small) forecast uncertainty should be associated with large (small) forecast error. Statistical consistency between predicted forecast uncertainty and actual forecast error is called reliability and is checked with the help of reliability plots. The reader not familiar with these concepts could refer to section 2.2 in Leutbecher and Palmer (2008).

In Fig. 12, perfect reliability is indicated with a diagonal line. Results for all three types of ML approaches show good performance of the uncertainty models: overall the dots are close to the diagonal. Errors larger than expected can occur when the forecast uncertainty is close to 0 (in particular for 2-m temperature prediction in winter with NN and LR). We also see a general tendency of the NN models to underpredict forecast errors. The combined model provides in most cases a more reliable forecast.

Fig. 12.
Fig. 12.

Reliability plots showing the uncertainty/error relationship for (a),(b) 2-m temperature and (c),(d) 10-m wind speed postprocessed forecasts, verifying over summer in (a) and (c) and winter in (b) and (d). The forecast uncertainty (x axis) corresponds to the square root of the residual error prediction, and the actual forecast error (y axis) corresponds to the RMSE of the bias-corrected forecast. Perfect reliability is indicated with a dashed diagonal line. For each plot, a histogram shows the number of cases in each forecast uncertainty category.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

The predicted forecast uncertainty could serve as a basis for delivering probabilistic forecasts. The first and second moments of an underlying forecast probability distribution can be derived from the systematic and residual error models, respectively. The uncertainty is valid at the point scale and so encompasses potential representativeness errors that cannot be captured by ensemble forecasting techniques. Also, as future work, the benefit of ML-based uncertainty models could be demonstrated using simple statistical models of representativeness errors as benchmark (as proposed recently in Ben Bouallègue et al. 2020).

6. Conclusions

In this study, we performed statistical postprocessing of ECMWF’s near-surface temperature and wind forecasts using three types of ML methods: linear regression, random forests, and neural networks. After a rigorous selection of predictors, ML models are trained to predict situation-dependent bias and uncertainty of the high-resolution IFS global forecasts. Two distinct statistical models are used to infer systematic errors on the one hand and residual (random) errors on the other hand. This two-model approach is applied to all three ML methods and feature importance analysis is performed for two ML models with comparable results. The source of random errors can therefore be explored independently of the source of systematic errors, with the first results indicating a close connection between the two types of error sources. The ML-based statistical models allow delivering postprocessed forecasts not only at locations of observation measurements but also at any other points on the globe. The promising results obtained for deterministic forecasts at short lead times encourage further research involving the ECMWF ensemble forecasts as well as longer lead times. In addition, the discussed ML approaches would be easily transferable to other weather variables such as precipitation.

Weather-dependent bias correction with ML techniques notably improves the forecast, with a reduction between 10% and 15% in terms of RMSE for all lead times and variables envisaged here. In essence, our study shows that the accuracy of the postprocessed forecasts does not depend so much on the choice of the ML method but more crucially on the selection of predictors, the size of the training and test datasets, and the quality control applied to the data. In this context, we have identified ML-based solutions for forecast postprocessing with different levels of complexity in terms of practical implementation. State-independent postprocessing configurations that only rely on predictors available before the start of the forecast are simple to implement and easy to maintain. Reduction in forecast error can be further improved with the help of more complex configurations involving state-dependent predictors and/or the combinations of ML models. Finally, the good performance of the forecast uncertainty models opens new horizons for the generation of calibrated probabilistic weather forecasts based on statistical models.

Acknowledgments.

For this work, Fenwick Cooper was partially funded by the International Foundation Big Data and Artificial Intelligence for Human Development (iFAB; www.ifabfoundation.org). Matthew Chantry gratefully acknowledges funding from the MAELSTROM EuroHPC-JU project (JU) under Grant 955513. The authors are also grateful to two anonymous reviewers for their valuable comments.

Data availability statement.

Forecasts used in this study are publicly available. For more information, please visit the website https://www.ecmwf.int/en/forecasts/accessing-forecasts. SYNOP observations used in this study cannot be shared with third parties.

APPENDIX A

Observation Quality Control

Observation quality control is first based on observation metadata. To start with, the elevation of each station is not necessarily fixed over the 2-yr period. Sometimes estimates of its location change, perhaps by rounding errors in the reported latitude and longitude. This changes the station’s elevation if it is automatically read from a map. Sometimes a station’s altitude is measured differently and sometimes the station actually moves. If a station moves, its elevation does not necessarily change, but the model elevation might. A total of 1882 of 13 573 (14%) stations exhibit elevation changes.

Sometimes, for some of the measurements at a particular station, the elevation is not recorded. In this case, we set the station elevation for the purposes of modeling to an elevation that is recorded for that station, before applying the criteria below.

We adopt the following WMO control criteria (WMO 2019). Independently for wind and temperature observations, measurements are rejected if:

  • The surface pressure is higher than 700 hPa (low elevation), and the measured versus lapse rate corrected forecast 2-m temperature difference is more than 15°C.

  • The surface pressure is lower than 700 hPa (high elevation), and the measured versus lapse rate corrected forecast 2-m temperature difference is more than 10°C.

  • Over the ocean the mean difference between the measured 2-m temperature, and the lapse rate corrected forecast temperature is greater than 4°C.

  • Over the ocean the standard deviation of the difference between the measured and forecast 2-m temperature computed over the training and test datasets is greater than 6°C.

  • Over the ocean the mean difference between the measured 10-m wind and the forecast wind is greater than 5 m s−1.

  • The surface pressure is higher than 775 hPa (low elevation), and the measured versus forecast 10-m wind difference is more than 35 m s−1.

  • The surface pressure is between 775 and 600 hPa (middle elevation), and the measured versus forecast 10-m wind difference is more than 40 m s−1.

  • The surface pressure is lower than 600 hPa (high elevation), and the measured versus forecast 10-m wind difference is more than 45 m s−1.

Contrary to the WMO quality control criteria, 10-m wind measurements were not rejected if there were less than 10 measurements in any particular month, or if the RMS forecast error in that month is greater than 15 m s−1. In addition, measurements are rejected for the following reasons:

  • All stations (126) that moved more than 10 km.

  • All forecasts where the observation latitude or longitude changed at all during the validity time of the forecast.

  • All stations with elevations recorded above 10 000 m, or stations where the elevation is never recorded.

  • Four predictors are derived from heat and radiation fields: surface solar radiation, surface thermal radiation, sensible heat, and latent heat fluxes. These are stored as cumulative quantities. They are converted to instantaneous fluxes using first-order finite differences. Where there is a gap between 2-m temperature or 10-m wind measurements of 6 h or more, the finite difference approximation is not sufficiently accurate and the forecast at this station location is rejected.

  • All stations recording a 2-m temperature above 56.7°C (207 stations) or below −78°C (73 stations). (Only 2-m temperature measurements rejected.)

  • All stations (1684 stations) recording a 10-m wind speed above 50 m s−1 (180 km h−1). (Only 10-m wind measurements rejected.)

APPENDIX B

RMSE of the Raw Forecasts

Figure B1 shows the RMSE of the IFS 2-m temperature and 10-m wind speed forecasts for both a winter and a summer period.

Fig. B1.
Fig. B1.

RMSE of IFS (left) 2-m temperature and (right) 10-m wind speed forecasts, in (top) winter and (bottom) summer. Results are aggregated over all lead times.

Citation: Monthly Weather Review 151, 4; 10.1175/MWR-D-22-0107.1

REFERENCES

  • Ben Bouallègue, Z., 2017: Statistical postprocessing of ensemble global radiation forecasts with penalized quantile regression. Meteor. Z., 26, 253264, https://doi.org/10.1127/metz/2016/0748.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., T. Haiden, N. J. Weber, T. M. Hamill, and D. S. Richardson, 2020: Accounting for representativeness in the verification of ensemble precipitation forecasts. Mon. Wea. Rev., 148, 20492062, https://doi.org/10.1175/MWR-D-19-0323.1.

    • Search Google Scholar
    • Export Citation
  • Düben, P., and Coauthors, 2021: Machine learning at ECMWF: A roadmap for the next 10 years. ECMWF Tech. Memo. 878, 20 pp., https://www.ecmwf.int/node/19877.

  • ECMWF, 2020: IFS documentation CY47R1—Part III: Dynamics and numerical procedures. ECMWF IFS Doc. 3, 31 pp., https://www.ecmwf.int/node/19747.

  • Haiden, T., M. Janousek, F. Vitart, Z. Ben Bouallègue, L. Ferranti, and F. Prates, 2021: Evaluation of ECMWF forecasts, including the 2021 upgrade. ECMWF Tech. Memo. 884, 56 pp., https://www.ecmwf.int/node/20142.

  • Hamill, T. M., 2021: Comparing and combining deterministic surface temperature postprocessing methods over the United States. Mon. Wea. Rev., 149, 32893298, https://doi.org/10.1175/MWR-D-21-0027.1.

    • Search Google Scholar
    • Export Citation
  • Hemri, S., M. Scheuerer, F. Pappenberger, K. Bogner, and T. Haiden, 2014: Trends in natural calibration of raw ensemble weather forecasts. Geophys. Res. Lett., 41, 91979205, https://doi.org/10.1002/2014GL062472.

    • Search Google Scholar
    • Export Citation
  • Leutbecher, M., and T. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 35153539, https://doi.org/10.1016/j.jcp.2007.02.014.

    • Search Google Scholar
    • Export Citation
  • Louppe, G., L. Wehenkel, A. Sutera, and P. Geurts, 2013: Understanding variable importances in forests of randomized trees. Proc. 26th Int. Conf. on Neural Information Processing Systems (NIPS’13), Vol. 1, Lake Tahoe, NV, Association for Computing Machinery, 431–439.

  • McGovern, A., R. Lagerquist, D. John Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc., 100, 21752199, https://doi.org/10.1175/BAMS-D-18-0195.1.

    • Search Google Scholar
    • Export Citation
  • Pedregosa, F. , and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 28252830.

  • Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, 2007: Numerical Recipes in C: The Art of Scientific Computing. 3rd ed. Cambridge University Press, 1256 pp.

  • Rasp, S., and S. Lerch, 2018: Neural networks for post-processing ensemble weather forecasts. Mon. Wea. Rev., 146, 38853900, https://doi.org/10.1175/MWR-D-18-0187.1.

    • Search Google Scholar
    • Export Citation
  • Sandu, I. , and Coauthors, 2020: Addressing near-surface forecast biases: Outcomes of the ECMWF project ‘understanding uncertainties in surface atmosphere exchange’ (USURF). ECMWF Tech. Memo. 875, 43 pp., https://www.ecmwf.int/node/19849.

  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • WMO 2019: Manual on the global data-processing and forecasting system. Annex IV to the WMO Technical Regulations, appendix 2.1.2, WMO, 132 pp.

Save
  • Ben Bouallègue, Z., 2017: Statistical postprocessing of ensemble global radiation forecasts with penalized quantile regression. Meteor. Z., 26, 253264, https://doi.org/10.1127/metz/2016/0748.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., T. Haiden, N. J. Weber, T. M. Hamill, and D. S. Richardson, 2020: Accounting for representativeness in the verification of ensemble precipitation forecasts. Mon. Wea. Rev., 148, 20492062, https://doi.org/10.1175/MWR-D-19-0323.1.

    • Search Google Scholar
    • Export Citation
  • Düben, P., and Coauthors, 2021: Machine learning at ECMWF: A roadmap for the next 10 years. ECMWF Tech. Memo. 878, 20 pp., https://www.ecmwf.int/node/19877.

  • ECMWF, 2020: IFS documentation CY47R1—Part III: Dynamics and numerical procedures. ECMWF IFS Doc. 3, 31 pp., https://www.ecmwf.int/node/19747.

  • Haiden, T., M. Janousek, F. Vitart, Z. Ben Bouallègue, L. Ferranti, and F. Prates, 2021: Evaluation of ECMWF forecasts, including the 2021 upgrade. ECMWF Tech. Memo. 884, 56 pp., https://www.ecmwf.int/node/20142.

  • Hamill, T. M., 2021: Comparing and combining deterministic surface temperature postprocessing methods over the United States. Mon. Wea. Rev., 149, 32893298, https://doi.org/10.1175/MWR-D-21-0027.1.

    • Search Google Scholar
    • Export Citation
  • Hemri, S., M. Scheuerer, F. Pappenberger, K. Bogner, and T. Haiden, 2014: Trends in natural calibration of raw ensemble weather forecasts. Geophys. Res. Lett., 41, 91979205, https://doi.org/10.1002/2014GL062472.

    • Search Google Scholar
    • Export Citation
  • Leutbecher, M., and T. Palmer, 2008: Ensemble forecasting. J. Comput. Phys., 227, 35153539, https://doi.org/10.1016/j.jcp.2007.02.014.

    • Search Google Scholar
    • Export Citation
  • Louppe, G., L. Wehenkel, A. Sutera, and P. Geurts, 2013: Understanding variable importances in forests of randomized trees. Proc. 26th Int. Conf. on Neural Information Processing Systems (NIPS’13), Vol. 1, Lake Tahoe, NV, Association for Computing Machinery, 431–439.

  • McGovern, A., R. Lagerquist, D. John Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc., 100, 21752199, https://doi.org/10.1175/BAMS-D-18-0195.1.

    • Search Google Scholar
    • Export Citation
  • Pedregosa, F. , and Coauthors, 2011: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12, 28252830.

  • Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, 2007: Numerical Recipes in C: The Art of Scientific Computing. 3rd ed. Cambridge University Press, 1256 pp.

  • Rasp, S., and S. Lerch, 2018: Neural networks for post-processing ensemble weather forecasts. Mon. Wea. Rev., 146, 38853900, https://doi.org/10.1175/MWR-D-18-0187.1.

    • Search Google Scholar
    • Export Citation
  • Sandu, I. , and Coauthors, 2020: Addressing near-surface forecast biases: Outcomes of the ECMWF project ‘understanding uncertainties in surface atmosphere exchange’ (USURF). ECMWF Tech. Memo. 875, 43 pp., https://www.ecmwf.int/node/19849.

  • Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Mon. Wea. Rev., 144, 23752393, https://doi.org/10.1175/MWR-D-15-0260.1.

    • Search Google Scholar
    • Export Citation
  • Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681E699, https://doi.org/10.1175/BAMS-D-19-0308.1.

    • Search Google Scholar
    • Export Citation
  • WMO 2019: Manual on the global data-processing and forecasting system. Annex IV to the WMO Technical Regulations, appendix 2.1.2, WMO, 132 pp.

  • Fig. 1.

    The 48-h forecast performance over Europe (35°–75°N, 12.5°W–42.5°E) of IFS (left) 2-m temperature and (right) 10-m wind speed over the last decade, for ECMWF’s operational high-resolution forecasts with respect to SYNOP observations. The blue (root-mean-square error) and red (mean error) lines indicate the calculation as performed by Haiden et al. (2021), while the respective dots represent the equivalent calculation performed with the data quality control criteria used here (see text). The black dots indicate the resulting errors after postprocessing by the best ML-based models derived here (see text for details).

  • Fig. 2.

    Illustrative example of the problem at hand based on synthetic data. (a) We are first interested in predicting the forecast error e = fo as a function of a predictor x. Model 1 provides the best estimate of the error as a function of x as represented by the solid red line. The residual error corresponds to the squared distance between the red line and the black dots. (b) As a second step, Model 2 is built to capture the residual error e=(f^o)2 as a function of x. The resulting estimated residual error is represented by the red line. In this simple example, Models 1 and 2 are both linear regression fits with two parameters each.

  • Fig. 3.

    RMSE of (left) 2-m temperature and (right) 10-m wind speed forecasts (against SYNOP observations) as a function of the number of predictors used. Results based on linear regression of quadratic polynomials for the summer test period only. For each plot, the top point is the RMSE of the default forecast and the following ones are after bias correction with a linear regression model with increasing incrementally the number of predictors. The order of the predictors is obtained using backward stepwise regression (see text).

  • Fig. 4.

    Predictor importance for the prediction of 2-m temperature (left) systematic errors and (right) residual errors. Importance is estimated with RF mean decrease in impurity. The error bars indicate the intertrees variability.

  • Fig. 5.

    Predictor importance as in Fig. 4, but for 10-m wind speed.

  • Fig. 6.

    Forecast bias as a function of the forecast lead time. Forecast of (left) 2-m temperature and (right) 10-m wind speed, in (top) winter and (bottom) summer. Zero bias is indicated by a black horizontal line.

  • Fig. 7.

    Forecast bias of 2-m temperature forecasts (left) before and (right) after correction using the combined model in (top) winter and (bottom) summer. Results are aggregated over all lead times.

  • Fig. 8.

    Forecast bias as in Fig. 7, but for 10-m wind speed.

  • Fig. 9.

    Forecast RMSE as a function of the forecast lead time. Forecast of (left) 2-m temperature and (right) 10-m wind speed, in (top) winter and (bottom) summer.

  • Fig. 10.

    Change in performance in terms of RMSE after bias correction. Blue colors indicate an improvement achieved with postprocessing. Forecasts of (left) 2-m temperature and (right) 10-m wind speed, in (top) winter and (bottom) summer. Results are aggregated over all lead times.

  • Fig. 11.

    As in Fig. 9, but for the aggregated correlation coefficients.

  • Fig. 12.

    Reliability plots showing the uncertainty/error relationship for (a),(b) 2-m temperature and (c),(d) 10-m wind speed postprocessed forecasts, verifying over summer in (a) and (c) and winter in (b) and (d). The forecast uncertainty (x axis) corresponds to the square root of the residual error prediction, and the actual forecast error (y axis) corresponds to the RMSE of the bias-corrected forecast. Perfect reliability is indicated with a dashed diagonal line. For each plot, a histogram shows the number of cases in each forecast uncertainty category.

  • Fig. B1.

    RMSE of IFS (left) 2-m temperature and (right) 10-m wind speed forecasts, in (top) winter and (bottom) summer. Results are aggregated over all lead times.

All Time Past Year Past 30 Days
Abstract Views 553 0 0
Full Text Views 1390 1256 74
PDF Downloads 1268 1101 63