1. Introduction
The term adaptive blending stands for the search of an optimal lead-time- and position-dependent weighting between two or more forecasts that cover different forecast ranges or are based on different forecast models, for example, different configurations of numerical weather prediction (NWP). The term blending represents a part of the so-called seamless prediction, where this term was originally introduced by Palmer et al. (2008) to describe the combination of weather prediction and climate modeling as a unified topic. In the course of time, the definition of seamless prediction was extended to also include interactions with biogeophysical components (Hazeleger et al. 2012) and, more generally, to describe the interactions between weather, climate, and the Earth system (Brunet et al. 2010). A recent publication by Ruti et al. (2020) defines seamless prediction as a whole value cycle enclosing four parts: the generation of information, the dissemination to users, the perception and decision-making, and the outcomes and values.
The currently ongoing project Seamless Integrated Forecasting System (SINFONY) of Deutscher Wetterdienst (DWD) can be assigned to the topic of seamless prediction, since it focuses on the seamless prediction of precipitation within the short-term range up to +12 h ahead. Here, the term seamless is referred to as the combination of forecasts of observation-based precipitation nowcasting techniques with those of the NWP. The goals of SINFONY to achieve this combination are twofold and address two parts of the definition of seamless prediction mentioned above. First, the generation of information is addressed by individually improving both forecasting methods—nowcasting and NWP—in such a way that in terms of a verification metric the gap between each of them is narrowed. Based on these improvements, the development and implementation of tailor-made combination methods will lead to a user-oriented unique forecast including condensed information of both individual forecasts. Moreover, the interaction with users is addressed by using the feedback of hydrological services and forecasters to further improve the products.
Nowcasting and NWP forecasts can provide valuable guidance for users on different lead-time scales (Heizenreder et al. 2015; Hess 2020). Many common precipitation nowcasting methods rely on the Lagrangian persistence approach, whereby the latest field of observed reflectivities or estimated rain rates is extrapolated in space and time by a previously determined motion vector field (Germann and Zawadzki 2002). Due to this purely advective approach, the dynamic uncertainty induced by growth and decay of precipitation patterns is not considered. Thus, the quality of such forecasts is high as long as the Lagrangian persistence assumption is valid (Zawadzki et al. 1994). The prediction of specific weather events depends on their spatial extent (Venugopal et al. 1999) and can reach from minutes when considering small-scale phenomena (e.g., single thunderstorms) up to hours at length scales of several hundred kilometers (Foresti and Seed 2014).
The physical evolution of precipitation fields is, on the other hand, explicitly simulated by NWP models. However, one source of forecast errors of the latter can be found in initial and boundary conditions as well as in inexact solutions of approximated physical equations due to finite resolutions in time and space. Nicolis et al. (2009) showed that subgrid parameterizations of cloud microphysics are especially important for precipitation forecasting. However, shortcomings in such a parameterization lead to deficiencies in simulated rainfall intensities (Stephan et al. 2008).
Despite these error sources, NWP forecasts are able to outperform the forecast quality of precipitation nowcasting techniques 2–3 h after initialization, as it will be shown in section 4. Therefore, the seamless combination aims to create a unique and consistent forecast in which the best skill is retained and the amount of information is condensed regardless of location and lead time (Brunet et al. 2015).
Vannitsem et al. (2021) present, among others, an overview of methods to combine forecasts of nowcasting and NWP and further point out that this combination may take place in physical or probability spaces. The weighting mentioned above can be based on a long-term comparative verification of both initial forecast systems. This is done in physical space by Golding (1998) in Nimrod, one of the first combination schemes, or in a probability space by Kober et al. (2012). Haiden et al. (2011) utilized in integrated nowcasting through comprehensive analysis (INCA) a simple linear weighting function, in which the weight for NWP forecasts increases from 0 at the beginning to 1 at a lead time of +4 h. The Short-Term Ensemble Prediction System (STEPS; Seed 2003; Seed et al. 2013) in its implementation by Bowler et al. (2006) quantifies in real time not only tendencies in a sequence of the latest observations, but also the skill of an NWP forecast to adjust weights for combining the nowcast extrapolation and the NWP forecast, depending on lead time and spatial length scale. A forecast ensemble is then generated by replacing nonpredictable scales with spatial correlated random noise. Moreover, the emergence of nowcasting ensemble techniques allows the use of the ensemble spread as an objective combination metric. Based on this, for example, Nerini et al. (2019) implemented an ensemble Kalman filter for the iterative combination of NWP forecasts and precipitation nowcasting extrapolations. Johnson and Wang (2012) as well as Bouttier and Marchal (2020) carried out combinations of multimodel ensembles.
With focus on nowcasting approaches based on machine learning (ML) techniques, many studies use model information as an additional predictor. In Han et al. (2017), radar observations combined with data of the Variational Doppler Radar Analysis System (VDRAS) are utilized to train a support vector machine (SVM) for answering the question whether there will be reflectivity > 35 dBZ within a box in the next 30 min based on the information in the adjacent boxes. Ukkonen et al. (2017) utilize an artificial neural network (ANN) with lightning and reanalysis data as input to evaluate thunderstorm predictors for Finland. An overview about machine learning approaches with focus on nowcasting is given by, for example, Prudden et al. (2020) and Cuomo and Chandrasekar (2021).
Besides accuracy, calibration, and spatial consistency, temporal consistency is also desired for operational forecasts. Here and in the following, the notion temporal consistency is not to be understood as the time-dependent correlation structure of a single forecast. Rather, it describes the variability between a number of model runs for a fixed valid time that is also often referred to as jumpiness. Ideally, there is a large uncertainty in early forecasts that decreases with time so that forecasts converge toward the observations and become more and more confident. However, in practice, it is often observed that updated forecasts for one specific time and location exhibit spurious jumps due to forecast errors. This is a problem for meteorologists, who want to rely on the most topical numerical forecast and may need to revise their opinion accordingly, especially in case of weather warnings. It seems to be very unreasonable if a warning is issued, canceled soon thereafter, and may be even reissued again, see, for example, Griffiths et al. (2019).
In the present paper two forecast systems are combined (nowcasting and NWP); each one has its own characteristics in temporal consistency, which affect the consistency of the combined product. The transition from nowcasting to NWP with larger forecast lead time may likely result in additional inconsistencies, since the systematic errors of the two systems differ. Moreover, the method of combination itself may introduce additional inconsistencies, for example, if individual architectures or configurations are used for the neural networks for each forecast step. Therefore, it is considered important to control the temporal consistency of the combined product. Ideally, spurious jumps are reduced by the combination; however, at least it should prevent additional inconsistencies from being introduced by the method of combination.
Several metrics have been introduced to assess temporal forecast inconsistency of a sequence of forecasts. Zsoter et al. (2009) construct a spatial inconsistency index that consists of the differences of two forecast fields normalized by their variability. They then define a flip-flop as an oscillation of that index of two consecutive forecasts around its mean of the entire sequence of forecasts. The forecast convergence score described by Ruth et al. (2009) comprises the count of forecast oscillations around a significance threshold and includes information about the convergence toward the following forecast as well as the magnitude of the oscillations. The convergence index of Ehret (2010) is a combination of counts of oscillations exceeding a significance threshold and counts of nonconvergent forecasts. The metric introduced in Richardson et al. (2020) is based on the average of all ensemble differences of consecutive model initializations. To compute the difference, the divergence function associated with the continuous ranked probability score (CRPS) is utilized. Griffiths et al. (2019) add up the distances between consecutive forecasts over a forecast sequence and divide the sum by the range of the forecasts.
To run and maintain an ML-based precipitation forecasting system in daily operations can be facilitated if the applied architecture of the ML system is simple and robust against changes in the training dataset. Furthermore, the training dataset should contain only few predictors that, besides that, are easy to maintain. Therefore, we would like to address the following issues with the present study. First, we want to assess the forecast quality of the set of hyperparameter optimized ANNs introduced in Schaumann et al. (2021), when they are trained on an alternative high-resolution dataset. The dataset comprises forecasts of DWD’s ensemble-based precipitation nowcasting scheme STEPS-DWD (Reinoso-Rondinel et al. 2022) and ensemble forecasts of an experimental setup of the operational high-resolution short-term NWP model Icosahedron Nonhydrostatic (ICON-D2) for the SINFONY project. Second, we want to explore to which extent the forecast inconsistency (jumpiness) can be reduced by the proposed set of ANN architectures, and whether it is further reduced if only one common ANN architecture is applied to all forecast lead times.
The remainder of the paper is structured as follows. Section 2 gives a brief overview of the utilized datasets. In section 3, we briefly review some of our previous work and explain which changes are made to the combination model in the present paper. Then, in section 4, the new combination model is validated, and the effects of each change are discussed. Finally, section 5 summarizes our study and draws some conclusions.
2. Data
The present study assesses the effects on forecast quality, when the set of hyperparameter optimized ANNs introduced in Schaumann et al. (2021) is trained on a dataset with higher resolution and input forecasts from other ensemble forecast models. For this purpose, we utilize DWD’s ensemble-based precipitation nowcasting scheme STEPS-DWD as well as ensemble forecasts of an experimental setup of the operational high-resolution short-term NWP model ICON-D2, both developed in the framework of SINFONY. The training dataset considered in the present study focuses on summertime heavy rainfall events and consists of data for three monthly time periods (from 26 May to 26 June 2016, from 1 to 23 June 2019, and from 3 June to 16 July 2020). In the following these datasets will be described in more detail.
a. STEPS-DWD
The probabilistic Radarvorhersage [RADVOR (radar forecast)] forecasts from our previous study are replaced by the new ensemble precipitation nowcasting method STEPS-DWD. The latter is based on the well-established STEPS approach (Seed 2003; Bowler et al. 2006; Seed et al. 2013; Foresti et al. 2016) and has been adapted and improved for DWD purposes within the framework of SINFONY. The forecasts are based on composites of radar reflectivities obtained by DWD’s radar network, which is depicted in Fig. 1 by the envelope of all radar measuring ranges. Furthermore, rain rates are derived by a method for quantitative precipitation estimation (QPE) that uses individual relations between radar reflectivities and rain rates for different types of hydrometeors (Steinert et al. 2021). STEPS-DWD is configured for the present study to consist of a cascade of first-order autoregressive processes on 12 spatial scales and to apply a new localization approach (Pulkkinen et al. 2020; Reinoso-Rondinel et al. 2022) for the estimation of the autoregressive parameters on each individual scale. Individual realizations of the ensemble are then generated by imprinting spatially correlated fields of stochastic noise in regions with precipitation. The spatially recomposed fields are then extrapolated by a constant vector backward scheme (Germann and Zawadzki 2002) based on a predetermined motion vector field. Nowcasts are computed every 30 min out to 6 h ahead with a temporal resolution of 5 min. The original fields with a spatial resolution of 1 × 1 km2 are interpolated onto the coarser NWP grid (2.2 × 2.2 km2). Afterward, the extrapolated rain rates are accumulated to hourly rainfall amounts. For lead times less than one hour, accumulations are computed also from radar-based QPE products, so that at the start of each extrapolation forecast, the hourly rainfall amount consists of the radar-based QPE products from the immediately preceding hour. This precipitation accumulation is used as the ground truth.
DWD’s operational radar network. The positions of radar sites of the network are depicted by white dots. The blueish circle around each site represents the range of the terrain-following precipitation scan of the dual-polarization radars. Darker blue shades represent areas covered by more than one radar. Additionally, the terrain height is illustrated in colors from green (low) to white (high). Note that the radar at Borkum, at the time of the training period, has been an older single-polarization radar with a lower data quality.
Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-22-0020.1
b. ICON-D2-EPS
Compared to the previous study (Schaumann et al. 2021) in which we used statistically postprocessed NWP forecasts as input for the neural network, we now switch to raw NWP ensemble forecasts computed by an experimental setup of the ICON model (Zängl et al. 2015) in limited area mode (LAM) on a central-European domain with a horizontal grid spacing of Δx ≈ 2.2 km and 20 forecast ensemble members. This deep-convection-allowing setup is called ICON-D2. Besides conventional observation data and Mode-S aircraft measurements, 3D volume radar reflectivities and radial winds are assimilated by DWD’s kilometer-scale ensemble data assimilation system (KENDA), which implements a localized ensemble transform Kalman filter (Schraff et al. 2016; Bick et al. 2016). Note that 40 members are used for the assimilation, while the first 20 members serve as initial conditions for the forecasts. Lateral and upper boundary conditions are provided by ICON-EU ensemble forecasts (larger trans-European domain, grid spacing 6.5 km, parameterized deep convection). For cloud microphysics the operational conventional one-moment scheme is used. Ensemble forecasts are initialized every 3 h and run up to 12 h ahead. From these forecasts we use hourly precipitation sums at each forecast hour.
3. Model and methods
In Schaumann et al. (2021), an ANN-based model for the combination of two probabilistic forecasts, which produces calibrated and consistent probabilities, has been proposed as a generalization of the so-called logistic regression, triangular functions, and interaction terms (LTI) model (see Schaumann et al. 2020). The term probabilistic forecast refers to probabilities for the occurrence of binary events, that is, the exceedance of precipitation thresholds. With a given probability space
In this section, we propose a few improvements of the ANN model and call the new version the combined, calibrated, consistent (C3) model.
a. Architecture and properties of the C3 model
In its current form, the C3 model consists of several feed-forward neural networks, each one for the combination of forecasts with respect to a specific lead time.
All neural networks considered in the C3 model consist of four types of layers arranged in the following order: 0–5 convolutional layers, one dense layer, one triangular functions layer, and one dense layer with softmax activation function, see Fig. 2.
The utilized network architecture (green) and the input and output data (blue) as a schematic representation adapted from Schaumann et al. (2021). The arrows depict the flow of information. Note that the input data in this study are forecasts from STEPS-DWD and ICON-D2.
Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-22-0020.1
The triangular functions layer transforms each scalar xi of its input in such a way that each neuron j of the following dense layer can be interpreted as a sum of functions
As a loss function the categorical cross entropy is used. The softmax layer produces a discrete probability distribution for the events that precipitation occurs either between two consecutive thresholds or below/above the lowest/highest threshold. Based on the discrete probability distribution a probability for the exceedance of each threshold is computed. The combined forecast is calibrated and consists of consistent probabilities for each precipitation threshold.
The specific hyperparameters of each neural network are determined individually by a hyperparameter optimization algorithm; see Tables 1 and 2. For more details about the triangular layer or the hyperparameter optimization algorithm, see Schaumann et al. (2021).
Selected configurations of hyperparameters for different lead times based on the results of Schaumann et al. (2021).
Selected configurations of hyperparameters for different lead times based on the results of Schaumann et al. (2021).
b. Training and validation
The first 3 weeks of the available dataset are used as a warm-up period on which the model is trained only. For the remaining dataset, a rolling-origin scheme (Armstrong and Grohman 1972) is applied. This is an iterative approach that simulates how new information becomes incrementally available for training over time in an operational setting. The available dataset is divided by a point in time t into two parts, where t represents the presence. The part of the dataset before t is considered to be in the past and therefore available for training, while the data after t are considered to be future and is used for the validation of the model. Over the course of the training and validation process, t is iteratively moved from the beginning to the end of the dataset, where in each step of the rolling-origin scheme the model is trained on the past data and validated on the future data.
c. Increase in spatial resolution
In the present paper, in order to increase the resolution of the combined forecast, two new initial forecasts are considered: STEPS-DWD and ICON-D2. Both have a spatial resolution of 2.2 × 2.2 km2, whereas the datasets previously used in Schaumann et al. (2021) have a resolution of 20 × 20 km2. As the input of the ANN consists of data for a fixed number of grid points determined by the convolutional layers, the spatial range of the input data shrinks when the resolution of the datasets is increased. To compensate for the finer grid, one could increase the size or dilution of the convolutional layers. Here, dilution refers to a method, where only every Nth row and column of the input data are passed on to the network, that is, a dilution of 2 doubles the spatial range of convolutional layers along the x and y axis without increasing the number of data points. However, it turned out that adapting the size of the convolutions to the finer grid did not result in better validation scores. Similarly, diluting the convolutions did not lead to better validation scores and additionally introduced artifacts to the combined forecast. Due to the gaps introduced by a dilution of N > 0, smaller structures in the input forecast influence every Nth grid point in the output, while neighboring grid points are unaffected. This leads to repeating patterns in the combined forecast. Therefore, we did not change the convolutions for the results obtained in the present paper.
d. Full utilization of the sampling window with an NaN mask
Due to technical reasons, the neural network in Schaumann et al. (2021) requires its input to have a rectangular shape and no missing data. Since the precipitation nowcasting forecasts are based on radar composites, which are shifted according to a motion vector field, parts may be shifted outside of the sampling window. Temporary radar outages further reduce the available data. Depending on shape and location of the area with available data, the largest usable rectangle might be considerably smaller than the area with available data itself. To utilize the whole dataset for the present paper, all values that are not a number (NaN) are replaced by the value −1 and a Boolean field that flags grid points with missing data is used as an additional input to the neural network. Instead of discarding part of the dataset, this approach allows the model to learn to ignore the values of −1.
e. Forecast persistence and consistency
Repeated runs of a forecast model at different starting times produce a sequence of forecasts with different lead times for the same valid date. In general, these forecasts become increasingly accurate with decreasing lead time and the ideal evolution would be a trend from inaccurate or climatological values to a more accurate forecast with decreasing lead times. However, due to random (nonsystematic) forecast errors, the trend is often not monotonous for the individual cases. Sometimes, older forecasts are more accurate than newer updates and spurious jumps in the forecasts appear. These inconsistencies or jumps are especially harmful for warning management. A weather warning that is issued for a specific date and time, canceled later on (based on a new forecast run), and then possibly issued again with the next forecast, is not considered trustworthy and can hardly be communicated to the public.
The FFI is evaluated for each grid point i, j individually and averaged over the whole evaluation period as mentioned in section 4c.
4. Results
a. Lead-time-dependent investigation on model performance
We want to assess whether the C3 model with its new implementations and the high-resolution input datasets is still able to produce high-quality forecasts, where we want to emphasize the core features of its forecasts: combination, calibration, and consistency. For this, we computed the bias, Brier skill score, reliability, sharpness, and the area under the relative operating characteristic (ROC) curve (AUC) over the whole evaluation period for the C3 model with two different hyperparameter settings to assess the importance of lead-time-dependent hyperparameters (C3: lead-time dependent, and
-
systematic model error,
-
forecast quality in terms of systematic model errors and random forecast errors,
-
conditional frequency bias,
-
forecast resolution, and
-
discrimination ability.
If we call the exceedance of an arbitrary threshold an event, all of these metrics are based on the gridboxwise actually observed event occurrence and/or the forecasted event probability. Thus, the bias describes the mean error (ME) of the forecasted event probabilities and indicates the unconditional systematic model error. The Brier skill score (BSS) consists of the mean-squared error (MSE), representing the Brier score (BS) itself, divided by a reference BS that is based on the sample climatology. The values of BSS reach from −∞ up to 1, whereas values of 1 and 0 depict a perfect forecast and the climatologic forecast, respectively. Further, the reliability indicates how far the reliability diagram of a forecast deviates from the ideal line, that is, the reliability is the weighted mean of the squared differences between the reliability diagram and the ideal line for each bin, where the weights are the number of forecasts within each bin. Ideally, the predicted probability is equal to the observed relative frequency in which case the reliability diagram is equal to the ideal line and the reliability is equal to zero. Therefore, the reliability provides information about the frequency bias of the forecasted event probabilities and represents a measure for the calibration of an ensemble forecast. The sharpness characterizes the unconditional distribution of the probability forecasts and provides information about the forecast resolution, that is, the ability to predict extreme values close to 0 or 1. It is represented by the variance of the forecasts. The last metric considered is the AUC, which provides information about a forecast’s ability to discriminate between events and nonevents.
The results obtained for the bias are depicted in the first column of Fig. 3 for both configurations of the C3 model, STEPS-DWD, and ICON-D2, as a green, red, yellow, and blue line, respectively. All four forecast techniques exhibit a nearly lead-time-independent systematic error. However, the event probability for the lowest threshold of 0.1 mm hourly rainfall amount is overestimated by ICON-D2 forecasts by 1 percentage point, whereas the extrapolations of STEPS-DWD reveal a slight underestimation by 1 percentage point. For higher thresholds, these systematic errors of STEPS-DWD and ICON-D2 diminish as the frequency of event occurrences within the evaluation period decreases. The forecasts of both configurations of C3 are bias free in the first two hours. Afterward, they exhibit only a small difference from zero, whereas C3 tends toward ICON-D2, and
(first column) Bias (%), (second column) BSS, (third column) reliability, (fourth column) sharpness, and (fifth column) AUC averaged over the evaluation period as validation scores for three of the nine considered thresholds for the combination model with lead-time-dependent hyperparameters (C3; green), the combination model with hyperparameters for a lead time of +1 h (
Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-22-0020.1
To assess the forecast quality in due consideration of the combination aspect, the results obtained for the BSS are illustrated in the second column of Fig. 3, in the same way as the bias. Here, the BSS of the precipitation nowcast extrapolations of STEPS-DWD starts with a high skill, since they start from the observation, but decrease rapidly since growth and decay processes of precipitation are not represented. Errors in initial and boundary conditions cause that the NWP forecasts of ICON-D2 start with a lower skill. However, the decrease with increasing lead time is not that pronounced due to the explicit simulation of the dynamical evolution. The intersection of both curves denotes the point when the quality of nowcast extrapolations sinks below those of the NWP forecasts and occurs around 2.5 h after initialization. The forecast quality in terms of BSS of both C3 models outperforms each individual input forecast technique at all lead times. Furthermore, the different hyperparameter settings only have a small effect on forecast quality, so that an optimal combination of the two input forecasts is achieved with both C3 models at all lead times. However, the approximately 0.1 higher BSS values should be treated with caution, since convolution-induced spatial smoothing of the forecasts [as discussed by, e.g., Cuomo and Chandrasekar (2021)] leads to better scores of continuous verification metrics.
If such smoothing effects decisively affect the forecasts of our C3 models, it should be visible in the reliability, since the frequency of predicted high probabilities is decreased, whereas the frequency of observed events for forecasted intermediate probabilities increases. At first, the area enclosed by the reliability curves and the aforementioned ideal course is depicted for each of the forecast systems in the third column of Fig. 3, again in the same manner as the bias. For STEPS-DWD, this area size reveals an increase with lead time, whereas that of the ICON-D2 forecasts is nearly constant. Note that we utilize raw NWP output data that is uncalibrated. The area size of both C3 models is smaller than those of the two input forecasts indicating that the curves are closer to the ideal course and, therefore, the combined forecasts are more reliable than the forecasts of the input systems. However, the size of the area fluctuates for the two C3 models at later lead times. This may be an indicator for shortcomings in the calibration due to the choice of triangular functions. The forecast calibration depends on the number of these triangular functions, which is equal to 9 for +1 h and only 5 and 3 for +4 h/+6 h and +5 h, respectively (cf. Table 2).
Nevertheless, the forecast sharpness of both C3 models is reduced compared to the forecasts of ICON and STEPS-DWD for all exceedance thresholds and lead times, as shown in the fourth column of Fig. 3. This may have several reasons. On one hand, the raw input forecast ensembles reveal a wider range of probabilities even for hourly rainfall amounts above 5 mm. Thus, the sharpness is increased, but at the expense of reliability. On the other hand, the increasing forecast uncertainty for higher lead times and the training on less frequent events lead to a loss of probabilities close to 1, which reduces the sharpness. However, the C3 models exhibit a higher AUC compared to ICON and STEPS-DWD forecasts, which is shown in the fifth column of Fig. 3. This may indicate an improved discriminating ability between events and nonevents, albeit this result should be treated with caution. Not only the missing high probability values, but also the low event base rate may be misleading. A further investigation on the discrimination ability of the C3 forecasts based on an improved AUC as described by Ben Bouallègue and Richardson (2022) may provide more reliable results.
To get a more detailed insight at the conditional bias, the reliability diagrams of the four forecast systems are depicted in Fig. 4 for the lead times +1, +3, and +6 h and five thresholds from 0.1 up to 5 mm. In addition, below each reliability diagram, the frequency histograms for each of the forecasts are depicted to give an evaluation of the forecast sharpness. Both the extrapolation nowcasts of STEPS-DWD and the forecasts of ICON-D2 are overconfident over the entire range of thresholds and lead times. The overconfidence of ICON-D2 forecasts increases especially with the threshold since the frequency of observed events is not only lower due to the higher threshold but may also be reduced due to errors in location. Besides the missing representation of the dynamics of precipitation in STEPS-DWD forecasts, their spread is small leading to that overconfidence. The results obtained for both combination models are well calibrated for all depicted thresholds at a lead time of +1 h. With increasing lead time, the
Reliability diagrams for the C3 model with different architectures (C3; green), the C3 model with one architecture (
Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-22-0020.1
These results show that even with the new dataset forecasts of both models C3 and
b. Investigation of spatial patterns in systematic model error and forecast quality
We want to investigate possible reasons for systematic model errors in the forecasts of STEPS-DWD and ICON-D2 and how the combined forecasts reduce these errors. Further, we want to explore whether spatial patterns are visible in the forecast quality. For this, the spatially resolved biases for lead times of +1 and +3 h are illustrated in Figs. 5a and 5b, respectively. The spatially resolved BSS is depicted in Figs. 6a and 6b for identical lead times.
Spatial distribution of the bias (%) averaged over the considered period for (a) +1- and (b) +3-h lead time. Depicted from left to right are STEPS-DWD, ICON-D2, and the combination model (C3) with different architectures. The right column shows the difference between C3 and
Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-22-0020.1
As in Fig. 5, but for the BSS.
Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-22-0020.1
The results for STEPS-DWD are shown in the left columns of Figs. 5 and 6. The nowcast extrapolations exhibit at a lead time of +1 h that the slight underestimation discussed in the previous section occurs almost over the entire domain. Only in regions close to radar sites that are covered by a single radar (cf. Fig. 1) an overestimation is visible. The underestimation may be caused by a loss of power induced by the way the spatially correlated stochastic noise fields are generated (see, e.g., Atencia and Zawadzki 2014). The overestimation may be due to rain rates estimated in different heights, for example, when rain rates estimated at maximum range of a given site are advected and compared to near-surface estimates of the respective radar site. A further reason could be attenuation caused by heavy precipitation directly at the radar site. With increasing lead time, errors due to the forecast field shifting lead to an underestimation in the western and southern part of the domain, whereas the missing of dynamical evolution increases the systematic error in the entire domain. The BSS reveals no distinct spatial pattern. However, some of the radar sites and also the Alps are visible, which is caused by a lower event occurrence due to radar outages, attenuation effects and/or beam blocking. With increasing lead time, the aforementioned strong decrease in the BSS is visible.
The spatially resolved results for ICON-D2 forecasts are shown in the second columns from the left of Figs. 5 and 6. Here, the bias not only depicts systematic model errors but also systematic errors between the simulated surface precipitation sum and the QPE used as ground truth. Therefore, the overestimation mentioned above can be attributed to typical radar and compositing shortcomings. First, the strong overestimation over the Alps is caused by beam blocking. Second, range attenuation and ground clutter can be seen at the Borkum radar site in the northwest due to a positive bias at long ranges and a local negative bias close to the radar site. For the radar site located at the Feldberg in the southwest a height difference where rain rates are simulated/estimated may be a reason for the underestimation of hourly precipitation sums. This underestimation is more distinct for a threshold of 1 mm. Nevertheless, the overestimation in the western part of the domain, which is noticeable especially for the threshold of 0.1 mm, could be attributed to meteorological phenomena. For a higher lead time the main patterns remain the same, however, with a higher magnitude. In addition to the aspects mentioned above for the spatially resolved BSS of STEPS-DWD, the range attenuation is more distinct for the BSS of ICON-D2, especially at a threshold of 1 mm. For later lead times, the decrease is not that pronounced as for STEPS-DWD.
The spatially resolved results for the C3 models are shown in the right columns of Figs. 5 and 6. One the one hand, it can be seen that the biases are reduced in terms of magnitude compared to both input forecast systems for +1 h lead time and both thresholds of 0.1 and 1 mm. However, the spatial patterns induced by the shortcomings of the radar-based QPE composite are still visible in the bias. On the other hand, even at this lead time some spatial patterns are comparable to those of ICON-D2. For example, the overestimation induced by beam blocking at the Alps and the range attenuation of the Borkum radar. This means that the deficits of the composite used as ground truth, for example, lower estimated rainfall amounts in regions covered just due to one radar and deviations due to ground clutter, are not learned by the C3 model. In addition, the C3 model forecasts are constrained by both input forecasts so that they are the result of the best possible combination. For a lead time of +3 h, the spatial bias patterns are closer to that of ICON-D2-EPS, although the overestimation in the western part of the domain is reduced and the underestimation in the range of the Feldberg radar is even more pronounced for the threshold of 0.1 mm. Therefore, even with a higher weighting toward ICON-D2-EPS, the systematic overestimation in the western part is reduced by the C3 model. However, the differences in hourly rainfall amount caused by the aforementioned height difference of the Feldberg radar are not reduced since STEPS-DWD forecasts exhibit an underestimation as well. The systematic error of the
c. Temporal consistency of the combined forecasts
Another question is how the temporal consistency of the combined forecasts compares to those of both initial forecasts, and how it is affected by the hyperparameter choice of the C3 model. The flip-flop index (FFI) given in Eq. (1) is averaged over the evaluation period and visualized in Fig. 7 for both initial forecast systems STEPS-DWD (STEPS) and ICON-D2 (ICON), the combination model (C3), as well as the modified combination model (
Flip-flop index (FFI) averaged (a) over the whole evaluation period and (b) over the evaluation period under the condition that the observed hourly precipitation is at least 0.1 mm. Note that the color scales of each subplot cover different value ranges. The FFI is depicted for both initial forecast systems STEPS-DWD (STEPS) and ICON-D2 (ICON), the combination model (C3) as well as the modified combination model (
Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-22-0020.1
The FFI of the probabilities for the events that hourly precipitation exceeds the thresholds 0.1 and 1 mm is presented in Fig. 7a as average over the evaluation period. To better understand what the FFI values mean in our case, a brief example is given. A sequence of forecasts is optimal in terms of temporal consistency if it follows the shortest distance between its minimum and its maximum. This distance is scaled by the maximum number of possible flip-flops within the sequence and is used as a reference value. A FFI of 0.03 indicates that the difference in event probabilities between two consecutive forecasts at a given grid box is on average 3 percentage points larger than the reference value. Many cases with no precipitation in forecast and observation reduce the average FFI. To account for this effect, Fig. 7b depicts the average FFI under the condition that the observed hourly precipitation is at least 0.1 mm, which leads to much larger values compared to the unconditional FFI in Fig. 7b.
The technique of STEPS-DWD consists of two main components that may affect the temporal consistency in different ways. First, a set of first-order autoregressive processes is considered that replace signals on spatial scales that are no longer predictable by spatially correlated noise. Second, an advection scheme is used that extrapolates the forecast fields based on a predetermined motion vector field. As can be observed in Fig. 4, STEPS-DWD forecasts are overconfident especially for longer lead times and higher thresholds, a convergence from climatological event probabilities toward observed event frequencies seems to appear less likely. Moreover, differences between estimated motion vector fields (e.g., lower magnitude, errors in direction) for different lead times may lead to spatial shifts of the predicted precipitation pattern. These spatial shifts may lead to a double penalty problem, when the precipitation patterns of two consecutive forecasts do not align, that is, both predict precipitation at two different locations, which results in high absolute differences between both locations. Additionally, the temporal evolution of precipitation is not covered by such an extrapolation forecast, that is, the observed stage of precipitation is extrapolated in time ignoring growth and decay processes. Therefore, any difference in observed precipitation frequency between two consecutive hours leads to an increase in FFI. Furthermore, due to the most common westerly winds and the advection of precipitation, values at the west border of the domain fade out with constant advection since the precipitation data cover only Germany. At a threshold of 1 mm, effects of beam blocking and range attenuation, as discussed above for bias and BSS, are more apparent.
The temporal consistency of NWP precipitation forecasts in convective situations may be affected by the time of convective initiation, the simulated dynamical evolution of precipitation, and also by the location of airmass boundaries or convergence lines. When we consider the unconditional FFI for the ICON-D2 (ICON) forecasts in Fig. 7a, they reveal two regions in which the unconditional FFI is elevated. First, this is a region in northwestern Germany that can be attributed to uncertainties in the location of airmass boundaries or convergence lines. This can also be seen in the conditional FFI in Fig. 7b. Second, this concerns the upland regions and the Alps that could be an indicator for the prediction uncertainty of orographically induced precipitation. However, this is less pronounced in the conditional FFI, where the largest values can be found in Bavaria. One reason for the reduction of the conditional FFI compared to the unconditional FFI over the Alps may be the frequency bias in observed events due to the previously discussed beam blocking.
Both combination models significantly improve the FFI for both thresholds and with respect to unconditional and conditional averaging. The C3 model has slightly higher FFI values than the
d. Forecast animations
The supplementary material includes animations showing +3-h probabilistic forecasts of the input techniques STEPS-DWD (STEPS; second column from the left) and ICON-D2 (ICON; third column) as well as the combination model (C3; fourth column) and the modified combination model (
Exemplary +3-h probabilistic forecasts of STEPS-DWD (STEPS), ICON-D2 (ICON), the combination model (C3), and the modified combination model (
Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-22-0020.1
The STEPS-DWD forecasts exhibit less spread at a threshold of 0.1 mm compared to those of ICON-D2. This corresponds to the results shown in the reliability diagrams of Fig. 4, where STEPS-DWD is overconfident. However, the area covered by probabilities is close to that of the observation, showing that the dynamical evolution of the precipitation field in this case barely affects the extrapolation forecast. Solely, the precipitation band from Switzerland to Bavaria is less covered by STEPS-DWD. In contrast, this precipitation band is more pronounced in the ICON-D2 forecast. However, the observed precipitation in the center of Germany is not predicted by the ICON-D2 forecast.
Both combination models, C3 and
5. Conclusions
a. Summary of results
Considering forecasts of hourly rainfall of an advection-based precipitation nowcasting ensemble and of a NWP ensemble system, one can have, on the one hand, radar outages or relocations of radar sites and, on the other hand, updates of the NWP model. A simple architecture in combination with a rolling-origin training scheme can made a ML-based seamless precipitation forecasting system robust against those changes in the training dataset and is thus able to support the operational running of a forecasting system. In addition, the training dataset should contain only few and easy maintainable predictors. To reinforce these demands, we extended the combination model presented in Schaumann et al. (2021) in order to improve its forecast quality and to make it more suitable for an operational setting. Furthermore, we evaluated the forecast quality of the hyperparameter optimized combination model, when trained on a new high-resolution dataset. This dataset consists, on the one hand, of forecasts of DWD’s ensemble-based precipitation nowcasting algorithm STEPS-DWD (Reinoso-Rondinel et al. 2022) and, on the other hand, of ensemble forecasts produced by an experimental setup of the operational high-resolution short-term NWP model ICON-D2.
The validation results for the new dataset show that the combination model and its modification achieve similar scores as for the previously considered dataset (Schaumann et al. 2021). More precisely, we were able to show that our C3 models are indeed consistent over the whole range of threshold exceedances considered in this study. The forecasts represent an optimal combination of the input forecasts of STEPS-DWD and ICON-D2, which is indicated by a higher Brier skill score over all thresholds and lead times. The impact of spatial smoothing caused by convolutions is reduced by the C3 models. That is effected, first, due to the utilization of probabilities based on hourly rainfall amount and, second, due to the forecast calibration. The reliability diagrams of the combination models are well calibrated for all lead times and at least for the two lowest thresholds. The only diagrams affected by the smoothing mentioned above are those of the thresholds of 1 and 2 mm, for the C3 model at +3 h and for the
In an operational setting robust and interpretable forecasts are important, that is, a forecast model should not only achieve high aggregate validation scores, but also produce spatially and temporally consistent forecasts. For this, we investigated the performance of both initial models and the combination models by considering spatially resolved validation scores, to see how well each model performs at single grid points. The spatially resolved scores of bias and BSS reveal typical shortcomings of radar measurements and radar compositing, for example, range attenuation that was not corrected due to the operation of a single-polarization radar, beam blocking, and temporary radar outages. However, the resulting spatial patterns are also visible in those results of the C3 models, indicating that these deficits are not learned by the latter models, since these deficits are also present in the ground truth.
Finally, we considered the flip-flop index as a measure of temporal consistency. The obtained results show that both combination models produce forecasts with spatially more homogeneous validation scores and an improved flip-flop score. However, some spatial artifacts remain along the boundaries of radar coverage areas, which is likely due to the radar composite being used as ground truth. A possible alternative ground truth for verification could be station measurements. Moreover, we tested a modification (
b. Outlook
The current combination model produces probabilities for the exceedance of thresholds at single grid points. However, for weather warnings it would be useful to predict probabilities for the exceedance of thresholds within predefined areas (e.g., river basins or municipal territories). As a next step we will investigate how the current combination model can be modified in order to predict such area-dependent exceedance probabilities.
Additionally, we will extend the underlying dataset by the winter months 2021/22 to investigate the performance of the combination model for different seasons, and whether additional predictors like orography, wind information, local forecast variance or ensemble spread improves the combined forecast.
Acknowledgments.
We thank Susanne Theis for valuable comments and suggestions that helped us to design and perform this study. Furthermore, we are grateful to three anonymous reviewers for their helpful comments on the manuscript. For the implementation and training of the C3 model the Tensorflow library was used.
Data availability statement.
The dataset utilized in the present study consists of ICON-D2 and STEPS-DWD forecasts both in an experimental nonoperational setup. Therefore, the dataset is not publicly available until the forecast systems are operational. Afterwards, the data can be found at https://opendata.dwd.de.
REFERENCES
Armstrong, J. S., and M. C. Grohman, 1972: A comparative study of methods for long-range market forecasting. Manage. Sci., 19, 211–221, https://doi.org/10.1287/mnsc.19.2.211.
Atencia, A., and I. Zawadzki, 2014: A comparison of two techniques for generating nowcasting ensembles. Part I: Lagrangian ensemble technique. Mon. Wea. Rev., 142, 4036–4052, https://doi.org/10.1175/MWR-D-13-00117.1.
Ben Bouallègue, Z., and D. S. Richardson, 2022: On the ROC area of ensemble forecasts for rare events. Wea. Forecasting, 37, 787–796, https://doi.org/10.1175/WAF-D-21-0195.1.
Bick, T., and Coauthors, 2016: Assimilation of 3D radar reflectivities with an ensemble Kalman filter on the convective scale. Quart. J. Roy. Meteor. Soc., 142, 1490–1504, https://doi.org/10.1002/qj.2751.
Bouttier, F., and H. Marchal, 2020: Probabilistic thunderstorm forecasting by blending multiple ensembles. Tellus, 72A, 1696142, https://doi.org/10.1080/16000870.2019.1696142.
Bowler, N. E., C. E. Pierce, and A. W. Seed, 2006: STEPS: A probabilistic precipitation forecasting scheme which merges an extrapolation nowcast with downscaled NWP. Quart. J. Roy. Meteor. Soc., 132, 2127–2155, https://doi.org/10.1256/qj.04.100.
Brunet, G., and Coauthors, 2010: Collaboration of the weather and climate communities to advance subseasonal-to-seasonal prediction. Bull. Amer. Meteor. Soc., 91, 1397–1406, https://doi.org/10.1175/2010BAMS3013.1.
Brunet, G., S. Jones, and P. M. Ruti, Eds., 2015: Seamless Prediction of the Earth System: From Minutes to Months. WMO Doc. 1156, 471 pp., https://library.wmo.int/doc_num.php?explnum_id=3546.
Cuomo, J., and V. Chandrasekar, 2021: Use of deep learning for weather radar nowcasting. J. Atmos. Oceanic Technol., 38, 1641–1656, https://doi.org/10.1175/JTECH-D-21-0012.1.
Ehret, U., 2010: Convergence index: A new performance measure for the temporal stability of operational rainfall forecasts. Meteor. Z., 19, 441–451, https://doi.org/10.1127/0941-2948/2010/0480.
Foresti, L., and A. Seed, 2014: The effect of flow and orography on the spatial distribution of the very short-term predictability of rainfall from composite radar images. Hydrol. Earth Syst. Sci., 18, 4671–4686, https://doi.org/10.5194/hess-18-4671-2014.
Foresti, L., M. Reyniers, A. Seed, and L. Delobbe, 2016: Development and verification of a real-time stochastic precipitation nowcasting system for urban hydrology in Belgium. Hydrol. Earth Syst. Sci., 20, 505–527, https://doi.org/10.5194/hess-20-505-2016.
Germann, U., and I. Zawadzki, 2002: Scale-dependence of the predictability of precipitation from continental radar images. Part I: Description of the methodology. Mon. Wea. Rev., 130, 2859–2873, https://doi.org/10.1175/1520-0493(2002)130<2859:SDOTPO>2.0.CO;2.
Golding, B., 1998: Nimrod: A system for generating automated very short range forecasts. Meteor. Appl., 5, 1–16, https://doi.org/10.1017/S1350482798000577.
Griffiths, D., M. Foley, I. Ioannou, and T. Leeuwenburg, 2019: Flip-flop index: Quantifying revision stability for fixed-event forecasts. Meteor. Appl., 26, 30–35, https://doi.org/10.1002/met.1732.
Haiden, T., A. Kann, C. Wittmann, G. Pistotnik, B. Bica, and C. Gruber, 2011: The Integrated Nowcasting through Comprehensive Analysis (INCA) system and its validation over the eastern Alpine region. Wea. Forecasting, 26, 166–183, https://doi.org/10.1175/2010WAF2222451.1.
Han, L., J. Sun, W. Zhang, Y. Xiu, H. Feng, and Y. Lin, 2017: A machine learning nowcasting method based on real-time reanalysis data. J. Geophys. Res. Atmos., 122, 4038–4051, https://doi.org/10.1002/2016JD025783.
Hazeleger, W., and Coauthors, 2012: EC-Earth V2. 2: Description and validation of a new seamless earth system prediction model. Climate Dyn., 39, 2611–2629, https://doi.org/10.1007/s00382-011-1228-5.
Heizenreder, D., P. Joe, T. Hewson, L. Wilson, P. Davies, and E. de Coning, 2015: Development of applications towards a high-impact weather forecast system. Seamless Prediction of the Earth System: From Minutes to Months, WMO Doc. 1156 419–443.
Hess, R., 2020: Statistical postprocessing of ensemble forecasts for severe weather at Deutscher Wetterdienst. Nonlinear Processes Geophys., 27, 473–487, https://doi.org/10.5194/npg-27-473-2020.
Johnson, A., and X. Wang, 2012: Verification and calibration of neighborhood and object-based probabilistic precipitation forecasts from a multimodel convection-allowing ensemble. Mon. Wea. Rev., 140, 3054–3077, https://doi.org/10.1175/MWR-D-11-00356.1.
Kober, K., G. C. Craig, C. Keil, and A. Dörnbrack, 2012: Blending a probabilistic nowcasting method with a high-resolution numerical weather prediction ensemble for convective precipitation forecasts. Quart. J. Roy. Meteor. Soc., 138, 755–768, https://doi.org/10.1002/qj.939.
Nerini, D., L. Foresti, D. Leuenberger, S. Robert, and U. Germann, 2019: A reduced-space ensemble Kalman filter approach for flow-dependent integration of radar extrapolation nowcasts and NWP precipitation ensembles. Mon. Wea. Rev., 147, 987–1006, https://doi.org/10.1175/MWR-D-18-0258.1.
Nicolis, C., R. A. Perdigao, and S. Vannitsem, 2009: Dynamics of prediction errors under the combined effect of initial condition and model errors. J. Atmos. Sci., 66, 766–778, https://doi.org/10.1175/2008JAS2781.1.
Palmer, T., F. Doblas-Reyes, A. Weisheimer, and M. Rodwell, 2008: Toward seamless prediction: Calibration of climate change projections using seasonal forecasts. Bull. Amer. Meteor. Soc., 89, 459–470, https://doi.org/10.1175/BAMS-89-4-459.
Prudden, R., S. Adams, D. Kangin, N. Robinson, S. Ravuri, S. Mohamed, and A. Arribas, 2020: A review of radar-based nowcasting of precipitation and applicable machine learning techniques. arXiv, 2005.04988, https://arxiv.org/pdf/2005.04988.pdf.
Pulkkinen, S., V. Chandrasekar, A. von Lerber, and A.-M. Harri, 2020: Nowcasting of convective rainfall using volumetric radar observations. IEEE Trans. Geosci. Remote Sens., 58, 7845–7859, https://doi.org/10.1109/TGRS.2020.2984594.
Reinoso-Rondinel, R., M. Rempel, M. Schultze, and S. Trömel, 2022: Nationwide radar-based precipitation nowcasting—A localization filtering approach and its application for Germany. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 15, 1670–1691, https://doi.org/10.1109/JSTARS.2022.3144342.
Richardson, D. S., H. L. Cloke, and F. Pappenberger, 2020: Evaluation of the consistency of ECMWF ensemble forecasts. Geophys. Res. Lett., 47, e2020GL087934, https://doi.org/10.1029/2020GL087934.
Ruth, D. P., B. Glahn, V. Dagostaro, and K. Gilbert, 2009: The performance of MOS in the digital age. Wea. Forecasting, 24, 504–519, https://doi.org/10.1175/2008WAF2222158.1.
Ruti, P. M., and Coauthors, 2020: Advancing research for seamless Earth system prediction. Bull. Amer. Meteor. Soc., 101, E23–E35, https://doi.org/10.1175/BAMS-D-17-0302.1.
Schaumann, P., M. de Langlard, R. Hess, P. James, and V. Schmidt, 2020: A calibrated combination of probabilistic precipitation forecasts to achieve a seamless transition from nowcasting to very short-range forecasting. Wea. Forecasting, 35, 773–791, https://doi.org/10.1175/WAF-D-19-0181.1.
Schaumann, P., R. Hess, M. Rempel, U. Blahak, and V. Schmidt, 2021: A calibrated and consistent combination of probabilistic forecasts for the exceedance of several precipitation thresholds using neural networks. Wea. Forecasting, 36, 1076–1096, https://doi.org/10.1175/WAF-D-20-0188.1.
Schraff, C., H. Reich, A. Rhodin, A. Schomburg, K. Stephan, A. Perianez, and R. Potthast, 2016: Kilometre-scale ensemble data assimilation for the COSMO model (KENDA). Quart. J. Roy. Meteor. Soc., 142, 1453–1472, https://doi.org/10.1002/qj.2748.
Seed, A. W., 2003: A dynamic and spatial scaling approach to advection forecasting. J. Appl. Meteor., 42, 381–388, https://doi.org/10.1175/1520-0450(2003)042<0381:ADASSA>2.0.CO;2.
Seed, A. W., C. E. Pierce, and K. Norman, 2013: Formulation and evaluation of a scale decomposition-based stochastic precipitation nowcast scheme. Water Resour. Res., 49, 6624–6641, https://doi.org/10.1002/wrcr.20536.
Steinert, J., P. Tracksdorf, and D. Heizenreder, 2021: Hymec: Surface precipitation type estimation at the German weather service. Wea. Forecasting, 36, 1611–1627, https://doi.org/10.1175/WAF-D-20-0232.1.
Stephan, K., S. Klink, and C. Schraff, 2008: Assimilation of radar-derived rain rates into the convective-scale model COSMO-DE at DWD. Quart. J. Roy. Meteor. Soc., 134, 1315–1326, https://doi.org/10.1002/qj.269.
Ukkonen, P., A. Manzato, and A. Mäkelä, 2017: Evaluation of thunderstorm predictors for Finland using reanalyses and neural networks. J. Appl. Meteor. Climatol., 56, 2335–2352, https://doi.org/10.1175/JAMC-D-16-0361.1.
Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bull. Amer. Meteor. Soc., 102, E681–E699, https://doi.org/10.1175/BAMS-D-19-0308.1.
Venugopal, V., E. Foufoula-Georgiou, and V. Sapozhnikov, 1999: Evidence of dynamic scaling in space-time rainfall. J. Geophys. Res., 104, 31 599–31 610, https://doi.org/10.1029/1999JD900437.
Zängl, G., D. Reinert, P. Rípodas, and M. Baldauf, 2015: The ICON (ICOsahedral Non-hydrostatic) modelling framework of DWD and MPI-M: Description of the non-hydrostatic dynamical core. Quart. J. Roy. Meteor. Soc., 141, 563–579, https://doi.org/10.1002/qj.2378.
Zawadzki, I., J. Morneau, and R. Laprise, 1994: Predictability of precipitation patterns: An operational approach. J. Appl. Meteor., 33, 1562–1571, https://doi.org/10.1175/1520-0450(1994)033<1562:POPPAO>2.0.CO;2.
Zsoter, E., R. Buizza, and D. Richardson, 2009: “Jumpiness” of the ECMWF and Met Office EPS control and ensemble-mean forecasts. Mon. Wea. Rev., 137, 3823–3836, https://doi.org/10.1175/2009MWR2960.1.