Neighborhood Ensemble Copula Coupling: Smoother and Sharper Calibrated Ensembles

Belinda Trotta aBureau of Meteorology, Melbourne, Victoria, Australia

Search for other papers by Belinda Trotta in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0009-0007-7082-7029
Open access

Abstract

Ensemble copula coupling (Schefzik et al.) is a widely used method to produce a calibrated ensemble from a calibrated probabilistic forecast. This process improves the statistical accuracy of the ensemble; in other words, the distribution of the calibrated ensemble members at each grid point more closely approximates the true expected distribution. However, the trade-off is that the individual members are often less physically realistic than the original ensemble: there is noisy variation among neighboring grid points, and, depending on the calibration method, extremes in the original ensemble are sometimes muted. We introduce neighborhood ensemble copula coupling (N-ECC), a simple modification of ECC designed to mitigate these problems. We show that, when used with the calibrated forecasts produced by Flowerdew’s (Flowerdew) reliability calibration, N-ECC improves both the visual plausibility and the statistical properties of the forecast.

Significance Statement

Numerical weather prediction (NWP) uses physical models of the atmosphere to produce a set of scenarios (called an ensemble) describing possible weather outcomes. These forecasts are used in other models to produce weather forecasts and warnings of extreme events. For example, NWP forecasts of rainfall are used in hydrological models to predict the probability of flooding. However, the raw NWP forecasts require statistical postprocessing to ensure that the range of scenarios they describe accurately represents the true range of possible outcomes. This paper introduces a new method of processing NWP forecasts to produce physically realistic, well-calibrated ensembles.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Belinda Trotta, belinda.trotta@bom.gov.au

Abstract

Ensemble copula coupling (Schefzik et al.) is a widely used method to produce a calibrated ensemble from a calibrated probabilistic forecast. This process improves the statistical accuracy of the ensemble; in other words, the distribution of the calibrated ensemble members at each grid point more closely approximates the true expected distribution. However, the trade-off is that the individual members are often less physically realistic than the original ensemble: there is noisy variation among neighboring grid points, and, depending on the calibration method, extremes in the original ensemble are sometimes muted. We introduce neighborhood ensemble copula coupling (N-ECC), a simple modification of ECC designed to mitigate these problems. We show that, when used with the calibrated forecasts produced by Flowerdew’s (Flowerdew) reliability calibration, N-ECC improves both the visual plausibility and the statistical properties of the forecast.

Significance Statement

Numerical weather prediction (NWP) uses physical models of the atmosphere to produce a set of scenarios (called an ensemble) describing possible weather outcomes. These forecasts are used in other models to produce weather forecasts and warnings of extreme events. For example, NWP forecasts of rainfall are used in hydrological models to predict the probability of flooding. However, the raw NWP forecasts require statistical postprocessing to ensure that the range of scenarios they describe accurately represents the true range of possible outcomes. This paper introduces a new method of processing NWP forecasts to produce physically realistic, well-calibrated ensembles.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Belinda Trotta, belinda.trotta@bom.gov.au

1. Introduction

Numerical weather prediction (NWP) ensembles require calibration to produce accurate probabilistic forecasts. There are many postprocessing techniques for producing calibrated probability forecasts. One may use a parametric model, where the forecast variable is assumed to follow some known distribution whose parameters are modeled as a function of the characteristics of the raw ensemble. Alternatively, one may use a nonparametric approach, either a simple one like Flowerdew’s reliability calibration (Flowerdew 2014) (which is the calibration method used in this paper) or a more sophisticated machine learning approach. In any case, the output is a forecast that describes the probability distribution of the variable of interest at either grid points or sites. However, for many applications—in particular, for hydrological modeling using precipitation forecasts—probabilistic forecasts are not sufficient because they do not capture the spatial and temporal correlations between different sites (or grid points) and times. To forecast probabilities of flooding, for example, we need to know the likelihood of a large amount of rain over a given area containing multiple grid points. Since the probability distributions at different grid points are obviously not independent, we cannot derive these area probabilities from the probabilistic forecast. What is required is a calibrated ensemble where each member represents a physically realistic scenario, and the probabilistic forecasts derived from the ensemble are also realistic.

A simple and widely used approach to building calibrated ensembles is ensemble copula coupling (Schefzik et al. 2013). Given a raw ensemble, and a calibrated probabilistic forecast derived from it, a calibrated ensemble is constructed as follows. Extract evenly spaced quantiles from the probabilistic forecast corresponding to the number of members of the original ensemble. Then, at each grid point, reorder the quantiles to match the rank order of the original ensemble at that point. This preserves much of the spatial correlation of the original forecast because nearby grid points will tend to have similar ranks in the raw ensemble and therefore be assigned similar values in the calibrated ensemble. The combination of reliability calibration and ensemble copula coupling (ECC) to generate calibrated ensembles is evaluated in Flowerdew (2014). As we discuss below, there are many other approaches to building calibrated ensembles, but ECC remains popular because of its simplicity: there is no training process, and a gridded prediction can be made using only the raw ensemble and the calibrated probabilistic forecast, without historical analyses.

However, the method has some limitations. The most obvious drawback is that the output is not as spatially correlated as the original ensemble. One reason for this is the common situation in precipitation forecasts where many members of the original ensemble are zero, but the corresponding quantiles of the calibrated forecast are all different small nonzero amounts, so the mapping from the zero members to the calibrated members is arbitrary, and this can cause differences between neighboring points. Additionally, neighboring points with similar values in the raw ensemble can nevertheless have different ranks, which causes discontinuities. Both these problems are described in more detail by Scheuerer and Hamill (2018), who propose techniques for addressing them, which we discuss further below.

A more subtle issue, arising from the probability calibration itself rather than the ECC part of the process, is that the individual ensemble members can have more muted extremes than the original ensemble. This is particularly true for nonparametric calibration methods, such as reliability calibration (as we will see in section 4). This makes the calibrated ensemble less useful for predicting extreme events, such as flooding.

In this paper, we propose two variations on ensemble copula coupling with the aim of improving the spatial correlation and physical plausibility of calibrated ensembles. Both use elements of existing approaches: smoothed-ECC is similar to a method described in Scheuerer and Hamill (2018) and neighborhood-ECC is an adaptation of a technique of Taillardat and Mestre (2020).

There are several other approaches for building calibrated ensembles; here, we briefly review the main types. Some variants of ECC change the way that realizations are sampled from the calibrated probability distribution at each point. In addition to the quantile sampling described above, Schefzik et al. (2013) mention two alternatives. The first is simple random sampling from the calibrated distribution. In the second approach, a parametric distribution is fitted to the raw ensemble, and the quantiles of the raw ensemble members in this distribution are calculated. Then, the values corresponding to these quantiles are extracted from the calibrated distribution. Hu et al. (2016) propose stratified random sampling from the calibrated distribution in order to better represent its extremes.

A well-known alternative to ECC is the Schaake shuffle (Clark et al. 2004), where historic gridded analyses are used to order the calibrated ensemble, rather than the NWP ensemble members. This method was extended by Scheuerer et al. (2017) to select historical dates having a similar spatiotemporal structure to the NWP forecast. Another way to select the historical analyses is to choose those representing similar meteorological states to the NWP forecast, as proposed by Bellier et al. (2017).

Other approaches attempt to model the spatial correlations with parametric copulas. For example, in the Gaussian copula approach of Feldmann et al. (2015), the calibrated ensemble is constructed by sampling from a multivariate distribution having dimension equal to the number of grid cells. The mean parameter of the distribution is derived from the NWP ensemble members at each grid cell via nonhomogeneous Gaussian regression, and the correlation matrix models the spatial correlations, which depend only on pairwise distances.

The above methods provide ways of generating a calibrated ensemble given a calibrated probabilistic forecast at each grid point; the calibration step is completely independent from the ensemble generation step. The methods presented in this paper also fall into this category. However, other work focuses on integrated methodologies that both calibrate the probabilities and generate spatially correlated realizations. In particular, approaches based on generative neural network models have been proposed, for example, by Price and Rasp (2022) and Chen et al. (2024). Such approaches can produce realistic-looking gridded outputs, but training is computationally intensive. A novel non–machine learning (non-ML) generative method proposed recently by Zhao et al. (2022) generates ensemble forecasts as weighted sums of empirical orthogonal functions, where the weights have a random component. The approach of Taillardat and Mestre (2020), discussed further below, uses quantile regression forests as the calibration method, followed by an ensemble-ordering algorithm, both applied over small neighborhoods of grid cells rather than single points. Our work demonstrates that the method can be adapted to decouple the ensemble generation step from the calibration step, so that their method can be applied with any univariate calibration approach.

2. Methods

We present two techniques, smoothed-ECC and neighborhood-ECC. Smoothed-ECC aims to address spatial discontinuities. Neighborhood-ECC applies additional processing to further reduce visual noisiness, while also restoring some of the sharpness lost in the reliability calibration process. The work of Scheuerer and Hamill aims to address some of the same problems as our approach, and our techniques use some similar ideas, so we briefly describe their approach in section 2b. We have also reimplemented the relevant methods as a benchmark for our approach, and in section 4, we make a detailed comparison.

a. Reliability calibration

The calibrated probabilistic forecast was produced using Flowerdew’s reliability calibration method (Flowerdew 2014) with 20 probability bins, as implemented in the IMPROVER software (Roberts et al. 2023). However, like the standard ECC method, our techniques could be used with any method of producing calibrated probabilities.

Reliability calibration works by fitting a piecewise-linear function mapping a predicted probability of exceeding a threshold to an adjusted probability based on empirical data. We briefly describe the algorithm as implemented in the IMPROVER software. For each threshold, the interval [0, 1] of possible predicted probability values is divided into some number of equal-sized bins. In our case, we use 20 bins. For each bin, the average of the predicted probabilities in the bin is calculated: these are the x values of the linear interpolation. The y values are given by the empirical probability that an observation exceeds the threshold, given that the predicted probability lies within the bin. Thus, the calibration function is essentially a piecewise-linear approximation of the calibration curve of the unadjusted predictions. The first and last sections of the calibration function are extrapolated, using the same slope, to x values outside the range of the bin means, in order to handle the case when the unadjusted probability is smaller than the smallest bin mean or larger than the largest bin mean. Since this can result in adjusted predictions outside the [0, 1] range, outputs are clipped to lie within this range. Finally, after calibrating all thresholds, the outputs are sorted by threshold to ensure that the prediction is monotone relative to the thresholds. The method is very flexible in its ability to model the empirical correction accurately since each threshold is calibrated separately and each threshold’s model has many degrees of freedom, namely, the locations of the bin means and the slopes of the piecewise-linear curve. This makes it well suited to calibrating precipitation forecasts, where the distribution is extremely skewed and has a point mass at zero, making it difficult to model accurately using standard statistical distributions.

b. Methods of Scheuerer and Hamill

We discuss here two techniques from Scheuerer and Hamill that relate to our work. The first technique aims to resolve the problem of arbitrary ordering of zeros. Scheuerer and Hamill (2018) propose ordering ensemble members corresponding to zeros in the raw ensemble using a smoothly varying function defined over the forecast grid. Specifically, they construct, for each realization, a smoothly varying function over the grid and replace zeros in the raw ensemble with this smoothed value. Then, this modified ensemble, rather than the raw ensemble, is used to order the calibrated ensemble members. This ensures that neighboring zero values in the raw ensemble are mapped to similar values in the calibrated ensemble. The smooth mapping is a linear combination of radial basis functions centered at each grid point, with distance given by the tricube kernel. The coefficients of the linear combination are sampled randomly from the uniform distribution with range [−1, 0]. More formally, the method is as follows. For each realization, define a grid G by randomly sampling a value at each grid point from the uniform distribution on [−1, 0]. Write G(x, y) to denote the value of G at the point (x, y). For a grid cell (x, y), the value of the smoothed grid G(x, y) is the weighted average of the grid cells in its N × N neighborhood. Writing ρ = [N/2], the weight of each cell (u, υ) in the neighborhood is given by
w(u,υ)=[1(|xu|ρ)3]3×[1(|yυ|ρ)3]3,
so that
G(x,y)=[|ux|<=ρ|υy|<=ρw(u,υ)G(u,υ)]/[|ux|<=ρ|υy|<=ρw(u,υ)].
Now, replace each zero value in the current realization of the raw ensemble with its corresponding value from G′. The process is repeated with a new random grid G for each realization. For our verification, we use a neighborhood size of 9, the same as that used for our smoothing methods. As we will see below, the neighborhood size has little effect on the output for the tricube method. Throughout this paper, we refer to this as the “tricube” method.
The second technique addresses the problem of rank discontinuities. The proposed solution is to apply further postprocessing to the calibrated ensemble members as follows. A regularized piecewise-linear regression is fitted at each grid cell, mapping the ordered raw ensemble members to the ordered calibrated ensemble members. The loss function is the sum of the squared-error loss between the piecewise linear function and the target, plus a regularization term consisting of a weighted sum of the regression coefficients. The regularization makes it more likely that neighboring grid points in a realization of the raw ensemble having similar values will map to the same rank in the calibrated ensemble. Given the sorted raw ensemble values r1, …, rn and calibrated ensemble members c1, …, cn, the piecewise-linear function ψ is defined as follows. If there is at most one nonzero value, set ψ(ri) = ci for all i; that is, do not modify the calibration. Otherwise, let n0 be the index of the last calibrated ensemble member equal to zero or set n0 = 1 if all calibrated ensemble members are nonzero. Then, ψ is defined to be the function with parameters γ0,γ1,,γnn+01 such that
ψ(x)=γ0+γ1x+i=n0+1n1γin0+1(xri).
The loss function for the optimization is
i=n0n[ψ(x)ci]2+λi=n0+1n1max(ci,1)γin0+12.
The first term of the loss is a sum of squared errors and the second is a regularization term with regularization parameter λ. In our implementation, we set λ = 0.5, the same as in Scheuerer and Hamill (2018). We briefly review the effect of changing this parameter in section 4c. We refer to this method, when combined with the tricube method above, as the “regularized tricube” method.

c. Smoothed-ECC

Our first technique is aimed at mitigating the issues described by Scheuerer and Hamill (2018), namely, that the mapping of zero members to nonzero members is arbitrary and that changes in the rank ordering between neighboring grid points cause discontinuities. Similarly to the first technique of Scheuerer and Hamill (2018), we use a modified, smoother version of the raw ensemble to order the members of the calibrated ensemble. We apply the technique to all ensemble members, not just the zeros, so that it addresses not only the problem of arbitrary ordering of zeros but also the problem of rank discontinuities. Our approach is to smooth the raw ensemble by replacing each grid cell’s value with the average of the grid cells in the 9 × 9 square neighborhood surrounding it; in other words, we convolve the original grid with a uniform square kernel of size 9 × 9. This is done separately for each ensemble member. Then, we use this smoothed version of the raw ensemble to provide the ordering for the calibrated ensemble. By choosing a relatively large kernel size, we reduce the probability that neighboring grid points have exactly the same value (the only way this can happen is if the set of 81 grid points in the surrounding neighborhood is the same) while also ensuring that they do not differ by a large amount. Although this does not totally eliminate the possibility of arbitrary ordering of zeros, or of rank discontinuities, we find that it is sufficient to greatly reduce the visual noisiness of the calibrated ensemble. Since we do not modify the values of the calibrated ensemble but only change the ordering in the realization dimension, the calibrated ensemble preserves the statistical properties of the probabilistic forecast.

The method is quite computationally efficient; the only additional step compared to the standard ECC method is computing the smoothed raw ensemble.

d. Neighborhood-ECC

Our second technique, neighborhood-ECC (N-ECC), is intended to be applied in conjunction with the smoothed-ECC (S-ECC) method just described. It describes a way of reordering the calibrated forecast not just among different realizations but also spatially within a small region. This further reduces the visual noisiness of the forecast and helps restore the high-intensity areas that have been muted by the calibration process.

Our method is as follows:

  1. Calculate the smoothed NWP ensemble as described in section 2c.

  2. Convert the calibrated probabilities to an ensemble having the same number of realizations as the original NWP forecast.

  3. Tile the grid with nonoverlapping neighborhoods of size 9 × 9 cells. We use a neighborhood size of 9, the same as in the smoothing step, but a different size could be used if desired.

  4. Within each neighborhood, consider the set of all members of the smoothed NWP ensemble over all cells [comprising 51 × 9 × 9 values in the case of ECMWF or 18 × 9 × 9 for ACCESS Global Ensemble (ACCESS-GE3)] and calculate the permutation needed to order them.

  5. Apply the reverse of this permutation to the sorted calibrated ensemble; this is the same process as in standard ECC but applied to the whole neighborhood instead of a single grid cell. This ensures that within the neighborhood, the calibrated forecast preserves spatial correlations of the smoothed raw ensemble within each ensemble member and the correlations among different ensemble members.

  6. For each pair of offsets (i, j) with i, j ∈ [0, 9), offset the tiling of neighborhoods vertically and horizontally by (i, j) and repeat steps 4 and 5 to obtain a new calibrated ensemble.

  7. Average the calibrated ensembles produced across the different iterations of the previous set. This avoids discontinuities at the boundaries of the neighborhoods.

Note that this approach preserves the expected value of the calibrated ensemble over the whole grid because each iteration of step 6 just rearranges the members of the calibrated ensemble spatially and among realizations. Allowing the grid points to move within a small area, guided by the smoothed realizations, causes grid cells with similar forecast values to cluster together spatially. This reduces visual noisiness and increases the sharpness of the forecast because high or low forecast areas are more aligned over different realizations. The neighborhood size of 9 was chosen by visually evaluating the gridded outputs for various choices of this parameter. We found that it represents a good compromise between smoothing the output enough to eliminate noise but not so much that it erases too much detail. In section 4c, we show the effect of different choices of this parameter.

Our approach is similar to the work of Taillardat and Mestre (2020), who describe an integrated method combining both probability calibration and ensemble reordering that operates on neighborhoods of size 5 × 5. In their approach, the probability calibration takes as input all NWP forecasts from the 16-member ensemble in the neighborhood (i.e., 16 × 5 × 5 = 400 points). These are treated as a pseudoensemble, and statistical features are extracted describing their distribution. The computed features constitute the predictors for a quantile regression forest algorithm, where the predictand is neighborhood quantiles, and the ground truth values are computed from a gridded analysis. The calibration produces as output 400 quantiles representing the distribution within the neighborhood. These 400 values are then ordered using the spatial and per-member order of the raw ensemble in the neighborhood. To smooth out the variation due to arbitrary assignment of zero ensemble members, the reordering step is repeated many times with different random assignments and the results are averaged.

Our approach uses the same idea of rearranging points within a neighborhood but it differs from Taillardat and Mestre’s method in that our probability calibration is done on individual grid points, and only the ensemble reordering uses neighborhoods. This means the training for the probability calibration can be done using only site-based (not gridded) forecasts and observations, making it less computationally intensive and avoiding potential problems caused by the analysis being less accurate than observations. Moreover, the ensemble reordering can be applied as a postprocessing step following any univariate calibration method. Like the method of Taillardat and Mestre (2020), our method benefits from averaging many iterations, which smooths the output. However, in our case, the reordering is deterministic on each iteration (since we use a stable sorting method), and the difference between iterations is that the tiling is shifted.

3. Data and test methodology

a. Data sources and preprocessing

The methods were tested on the ECMWF (Molteni et al. 1996) and ACCESS-GE3 (Bureau of Meteorology 2019) ensemble forecasts of total accumulated precipitation over the Australian domain at lead time 24 h. The details of these NWP forecasts are summarized in Table 1. Data were regridded to an Albers equal area grid with a 9600-m resolution before calculating the calibrated probabilistic forecast and applying the ECC techniques.

Table 1.

NWP forecasts used for evaluating the methods.

Table 1.

Observation data were preprocessed as follows before applying reliability calibration. Values above 500 mm were removed. Some observations of 0.2 mm were adjusted to zero to account for the likelihood that they were caused by dew. This was done because the Bureau does not consider dew to be true precipitation, so it is desirable that the forecast probabilities for 0.2 mm exclude dew. The data were also filtered to exclude sites not measuring sufficiently small rain amounts: specifically, we excluded sites where the smallest measurement was greater than 0.2 mm. In total, around 2000 observation sites were used in the calibration.

Forecasts were verified against site observations from around 600 automatic weather station (AWS) sites in Australia; this is a smaller set than that used for training, since AWS sites are known to be more accurate. We again excluded sites unable to report small amounts but did not make any correction for dew, since it is not possible to know exactly which 0.2-mm observations are dew. Forecasts at sites, for calculating the reliability calibration, and for verification, were extracted using the nearest-gridpoint method.

b. Data processing steps

To produce a probabilistic forecast suitable for applying reliability calibration, the NWP ensembles were thresholded using the following thresholds (in mm): 0.0, 0.01, 0.05, 0.1, 0.2, 0.4, 0.6, 1, 2, 5, 7, 10, 15, 25, 35, 50, 75, 100, 125, 150, 200, 250, 300, 350, 400, 450, and 500. Specifically, for each of these thresholds, we obtained a probabilistic forecast of exceeding the threshold by calculating the proportion of ensemble members greater than or equal to the threshold. Then, the reliability calibration method was used to calculate a separate correction function for each of these thresholds. The training period for calculating the reliability calibration was August 2020–July 2021. A total of 356 days of data were available for ECMWF and 363 for ACCESS-GE3 during this period.1 We then applied the calibration function to produce calibrated probabilistic forecasts over the test period, which spans September 2021–August 2022 and consists of 352 days. The different variants of ECC, and the choices of parameters, such as the neighborhood size, were each evaluated on the calibrated forecast over this test period.

The inputs for the various ECC methods were obtained by extracting equally spaced quantiles of the calibrated probabilistic forecast by linear interpolation. After applying the ECC methods, the calibrated ensemble was thresholded, using the same set of thresholds as used in the reliability calibration. This resulted in a probabilistic forecast with a piecewise-linear distribution function, which was used for univariate verification at sites.

The open-source IMPROVER software version 1.6.1 (Roberts et al. 2023) was used for data processing and probability calibration.

4. Results

Here, we review the gridded forecasts and accuracy metrics at sites for the various forecast methods.

Timings to process one forecast grid for the ECMWF ensemble using a single CPU thread are shown in Table 2. ECC, S-ECC, and tricube ECC are very fast, taking only a few seconds. Regularized tricube ECC takes around 1 min and N-ECC around 2 min.

Table 2.

Times (s) to process one ECMWF forecast grid using one thread.

Table 2.

a. Spatial verification

Here, we compare the gridded outputs of the various methods visually and evaluate them statistically using the multivariate energy score.

Note that the arbitrary ordering of zero forecasts does not have a significant effect on our results. This is because in almost all cases where a raw ensemble member is zero, the corresponding calibrated forecast is also very close to zero, as shown in Table 3.

Table 3.

Distribution of calibrated forecasts when the raw NWP forecast is zero. The values show the proportion of calibrated forecasts falling into each millimeter range. Almost all the calibrated forecasts are in the [0, 0.01] mm range.

Table 3.

Figure 1 compares the calibration methods for a single ensemble member for an ECMWF forecast. S-ECC greatly reduces the noisiness compared to ECC. N-ECC further smooths the result. It also avoids the sharp transitions between high- and low-rainfall regions that are present in S-ECC and fills some of the white-colored “holes” visible in the rainband over the southeast corner of the grid. Tricube ECC is not noticeably different from ECC, likely due to the fact that our calibration tends to map zeros to very small values, and thus, the arbitrary ordering of zeros is not a significant problem. Regularized tricube ECC slightly increases the intensity of the high-rainfall areas.

Fig. 1.
Fig. 1.

Ensemble realization 40 for ECMWF at valid time 0000 UTC 2 Oct 2021. Plots depict the raw ensemble, standard ECC, S-ECC, N-ECC, tricube ECC, and regularized tricube ECC. The color scale is in millimeters.

Citation: Journal of Hydrometeorology 25, 8; 10.1175/JHM-D-23-0188.1

Figure 2 shows the mean over all realizations for a particular forecast time. Notice that the N-ECC forecast is sharper than ECC and S-ECC: in particular, the center of the high rain region over the southeast coast of the continent has a higher value. As noted in section 2d, the ensemble mean over the whole grid is unchanged compared to ECC/S-ECC; the increase in intensity in a small area is the result of spatially clustering similar-valued points. The tricube ECC and regularized tricube ECC methods are very similar to ECC.

Fig. 2.
Fig. 2.

Ensemble mean forecasts for ECMWF at valid time 0000 UTC 2 Jul 2022. The color scale is in millimeters. Note that the N-ECC forecast has slightly higher intensity for the center of the rain region over the southeast coast.

Citation: Journal of Hydrometeorology 25, 8; 10.1175/JHM-D-23-0188.1

The energy score (Gneiting et al. 2008) is a multivariate generalization of the continuous ranked probability score (CRPS, discussed below). In contrast to the CRPS, which scores each forecast location independently, the energy score treats forecasts and observations as multivariate quantities, represented by n-dimensional vectors where n is the number of grid cells or sites. For an ensemble forecast consisting of a set of n-dimensional vectors, the score measures how well the distribution of forecasts matches the observations. For a random variable X representing the forecast distribution, and an observed value y, the energy score is defined as follows:
ES=E(Xy)12E(XX),
where X′ is an independent copy of X with the same distribution and ‖x‖ denotes the Euclidean norm of x. In the case where X represents the discrete distribution given by a multivariate m-member ensemble forecast, such as a forecast defined at n grid points or sites, this becomes
ES=1mi=1mxiy12m2i=1mj=1mxixj.
Note that each xi, xj, and y is a vector of length n representing the forecast at all sites or grid cells.

Table 4 shows the energy scores for all methods. All methods are much better than the raw ensemble, and the differences between ECC and the other variations are small, with ECC and tricube ECC having the best scores and N-ECC having the worst score.

Table 4.

Energy scores for the methods (mm).

Table 4.

b. Univariate verification

In this section, we examine the statistical properties of the calibrated ensemble at the level of individual grid points, without considering spatial correlations. Note that because S-ECC and tricube-ECC are the same as ECC except for the permutation of realizations at each grid point, they have the same ensemble mean output and the same univariate accuracy metrics.

Figure 3 compares the reliability curves for the 1- and 10-mm thresholds. The raw ensemble is overly confident at high probabilities; that is, the observed frequency is lower than the predicted probability. In contrast, the ECC/S-ECC methods err in the other direction; indeed, they fail to predict any very high probabilities at all. This is an inherent limitation of reliability calibration. This method represents the distribution as a fixed set of thresholds and, for each threshold t, calculates based on observation data the empirical probability of exceeding threshold t given the probability predicted by the NWP ensemble of exceeding the same threshold. But the NWP forecast does not provide enough information to confidently predict high probabilities: as we see from the reliability curve of the raw ensemble, even when almost all ensemble members exceed t, the empirical probability of exceeding t is significantly lower, which limits the maximum probability that the calibrated output can predict. N-ECC corrects this problem: it achieves much better calibration and is able to accurately predict high probabilities, restoring much of the sharpness from the original ensemble. The regularized tricube method also increases sharpness, although not as effectively as N-ECC.

Fig. 3.
Fig. 3.

(top) Reliability curves for ECMWF at 1- and 10-mm thresholds, with error bars estimated from sample size. The x axis is the forecast probability (grouped into 20 equally spaced bins and averaged within each bin) and the y axis is the observed probability for forecasts in the corresponding bin. (bottom) The distribution of forecast probabilities, grouped into the same bins. Note the logarithmic scale of the y axis.

Citation: Journal of Hydrometeorology 25, 8; 10.1175/JHM-D-23-0188.1

We use the Brier score and the CRPS to evaluate our ensemble forecast against observations on a univariate basis. For a rainfall threshold t, a forecast probability P(x) of exceeding t, and an observation y, the Brier score is
[P(x)Hy(x)]2,
where Hy(x) is the Heaviside step function equal to 1 if xy and 0 otherwise. Table 5 shows the Brier scores at selected thresholds for the various methods; we see that N-ECC is in general either the best method or differs only very slightly from the best.
Table 5.

Brier scores for the methods. The best value in each row is given in bold.

Table 5.
For a probabilistic forecast F(x), and an observation y, the CRPS is defined as
[F(x)Hy(x)]2dx,
where Hy(x) is the Heaviside step function defined above. Thus, the CRPS measures how well the predicted probability distribution agrees with the observation. In our case, the CRPS was calculated from the piecewise-linear cumulative distribution function obtained by thresholding and interpolating the ensemble forecasts. Table 6 shows the CRPS scores for each forecast. ECC/S-ECC greatly improves on the raw ensemble score, and N-ECC yields a small further improvement. Thus, N-ECC increases sharpness without significantly compromising accuracy; this is possible because our technique takes advantage of information about the nearby points via the neighborhood smoothing, which the probability calibration cannot do because it operates on each cell separately.
Table 6.

Continuous rank probability scores for the methods (mm).

Table 6.

c. Effect of neighborhood size and regularization parameter

Here, we investigate the effect of varying the neighborhood size (for all methods) and the regularization parameter used in Scheuerer and Hamill’s regularized calibration.

Figure 4 shows a sample realization. We see that for S-ECC and N-ECC, the smoothing effect increases with the neighborhood width and that width 9 represents a good compromise between reducing noise and not overly distorting the shape of weather patterns. The tricube smoothing has no visible effect for any neighborhood size, likely because, as noted above, the arbitrary ordering of zeros does not significantly affect our calibration.

Fig. 4.
Fig. 4.

Ensemble realization 40 for ECMWF at valid time 0000 UTC 2 Oct 2021.

Citation: Journal of Hydrometeorology 25, 8; 10.1175/JHM-D-23-0188.1

Figure 5 shows an ensemble mean forecast for the various widths. All neighborhood sizes result in increased intensity for the high-rainfall area, but again, width 9 achieves this increased sharpness while not significantly changing the contours of the rainfall, compared to the standard ECC method.

Fig. 5.
Fig. 5.

Ensemble mean for ECMWF at valid time 0000 UTC 2 Jul 2022.

Citation: Journal of Hydrometeorology 25, 8; 10.1175/JHM-D-23-0188.1

As noted above, the S-ECC and tricube smoothing methods have the same CRPS as standard ECC since they only rearrange points among realizations. Table 7 shows the effect of the neighborhood size on the CRPS of the N-ECC method; the differences are very small.

Table 7.

Continuous rank probability scores (mm) for N-ECC with different neighborhood widths.

Table 7.

Table 8 shows the effect of the neighborhood size on the energy score for S-ECC and N-ECC. For S-ECC, larger neighborhoods result in worse scores. For N-ECC, width 9 is slightly better than the other choices. In all cases, the differences are relatively minor.

Table 8.

Energy scores (mm) for N-ECC with different neighborhood widths.

Table 8.

As noted above, our verification of Scheuerer and Hamill’s regularized calibration uses λ = 0.5 for the regularization parameter. We briefly investigated the effect of changing this parameter. The following values were tested: 0.1, 0.5, 1, 2, 5, 10, and 50. The visual effect on the gridded forecasts was not dramatic, although larger values of λ did result in slightly smoother outputs. The effect on the CRPS and energy score metrics was also fairly small, with larger values generally making the scores worse. As noted above, the regularized calibration increases the sharpness of the forecast compared to ECC. This effect is stronger for larger values of λ, although even at λ = 50, the forecast is not as sharp as that produced with N-ECC.

5. Discussion

We have presented two techniques for improving calibrated ensembles. S-ECC rearranges the order of members at each grid point to produce a less noisy output than standard ECC. Because there is no spatial rearrangement, statistical properties of the calibrated probabilistic forecast are preserved. This technique is very fast to apply and does not require any historical training data.

N-ECC extends S-ECC to also rearrange points spatially within small neighborhoods in such a way as to cluster points with similar forecast values. This technique is more computationally intensive than S-ECC but is fast enough to be feasible for operational applications. N-ECC yields a further improvement in the visual smoothness of the forecast while also increasing the sharpness so that areas of high intensity from the original NWP ensemble are better preserved. The quantitative metrics also compare favorably to ECC: N-ECC has much better reliability calibration than standard ECC and slightly better CRPS, while the energy score is decreased only slightly. The reason that we can simultaneously increase both reliability and sharpness is that the increase in sharpness is achieved by rearranging the forecast grid points in a systematic manner informed by the local average forecast, meaning that the forecast now incorporates some information from neighboring grid points, which is not the case in the basic ECC approach. While N-ECC does not preserve the statistical properties of the probabilistic calibration, the expected value over the whole grid is preserved. The success of N-ECC in improving the sharpness of the forecast makes it an appealing option for generating calibrated ensembles of precipitation forecasts, where forecasts of extreme values are important for hydrology applications such as predicting flooding.

Like other variants of ECC, our method relies on the spatial correlations present in the raw ensemble, and the performance of our N-ECC technique is limited by the accuracy of the raw ensemble. Additionally, unlike the Schaake shuffle or generative machine learning approaches, which can generate any number of calibrated realizations, our method can generate only the number present in the original ensemble.

This paper focused specifically on improving the calibrated precipitation ensembles produced by ECC in conjunction with Flowerdew’s reliability calibration method. An avenue for further work would be to explore how the techniques presented work with other calibration methods, or with other ways of specifying the spatial correlation structure of the ensemble, such as the Schaake shuffle. Other possible extensions would be to evaluate the performance of the N-ECC method on multivariate forecasts or over multiple lead times. In the latter case, one could either calibrate each lead time separately or use temporal as well as spatial reordering of the points: specifically, in the algorithm described in section 2d, one could use three-dimensional patches (with the third dimension being the lead time), rather than two-dimensional.

1

Due to issues with preprocessing systems, data for some days were unavailable.

Acknowledgments.

I thank Robert Johnson and Benjamin Owen for their helpful feedback on drafts of this paper and Lily Gao for her assistance with searching the literature. The code used for data preprocessing, analysis, and plotting builds on the code developed by past and present members of the Forecast Improvement team at the Bureau of Meteorology. In particular, the verification relies heavily on the Bureau’s internal Veracity software. I thank ECMWF for making available the data for the medium-range ensemble.

Data availability statement.

Code was written in Python 3.9. Code for the various ECC implementations is available in the following repository, which also lists the package dependencies: https://github.com/btrotta-bom/necc-paper-code.

REFERENCES

  • Bellier, J., G. Bontron, and I. Zin, 2017: Using meteorological analogues for reordering postprocessed precipitation ensembles in hydrological forecasting. Water Resour. Res., 53, 10 08510 107, https://doi.org/10.1002/2017WR021245.

    • Search Google Scholar
    • Export Citation
  • Bureau of Meteorology, 2019: APS3 upgrade of the ACCESS-G/GE Numerical Weather Prediction system. NOC Operations Bulletin 125, 68 pp., http://www.bom.gov.au/australia/charts/bulletins/opsbull_G3GE3_external_v3.pdf.

  • Chen, J., T. Janke, F. Steinke, and S. Lerch, 2024: Generative machine learning methods for multivariate ensemble postprocessing. Ann. Appl. Stat., 18, 159183, https://doi.org/10.1214/23-AOAS1784.

    • Search Google Scholar
    • Export Citation
  • Clark, M., S. Gangopadhyay, L. Hay, B. Rajagopalan, and R. Wilby, 2004: The Schaake Shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. J. Hydrometeor., 5, 243262, https://doi.org/10.1175/1525-7541(2004)005<0243:TSSAMF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Feldmann, K., M. Scheuerer, and T. L. Thorarinsdottir, 2015: Spatial postprocessing of ensemble forecasts for temperature using nonhomogeneous Gaussian regression. Mon. Wea. Rev., 143, 955971, https://doi.org/10.1175/MWR-D-14-00210.1.

    • Search Google Scholar
    • Export Citation
  • Flowerdew, J., 2014: Calibrating ensemble reliability whilst preserving spatial structure. Tellus, 66A, 22662, https://doi.org/10.3402/tellusa.v66.22662.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., L. I. Stanberry, E. P. Grimit, L. Held, and N. A. Johnson, 2008: Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. TEST, 17, 211235, https://doi.org/10.1007/s11749-008-0114-x.

    • Search Google Scholar
    • Export Citation
  • Hu, Y., M. J. Schmeits, S. J. van Andel, J. S. Verkade, M. Xu, D. P. Solomatine, and Z. Liang, 2016: A stratified sampling approach for improved sampling from a calibrated ensemble forecast distribution. J. Hydrometeor., 17, 24052417, https://doi.org/10.1175/JHM-D-15-0205.1.

    • Search Google Scholar
    • Export Citation
  • Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF ensemble prediction system: Methodology and validation. Quart. J. Roy. Meteor. Soc., 122, 73119, https://doi.org/10.1002/qj.49712252905.

    • Search Google Scholar
    • Export Citation
  • Price, I., and S. Rasp, 2022: Increasing the accuracy and resolution of precipitation forecasts using deep generative models. Proc. Mach. Learn. Res., 151, 10 55510 571.

    • Search Google Scholar
    • Export Citation
  • Roberts, N., and Coauthors, 2023: IMPROVER: The new probabilistic postprocessing system at the Met Office. Bull. Amer. Meteor. Soc., 104, E680E697, https://doi.org/10.1175/BAMS-D-21-0273.1.

    • Search Google Scholar
    • Export Citation
  • Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling. Stat. Sci., 28, 616640, https://doi.org/10.1214/13-STS443.

    • Search Google Scholar
    • Export Citation
  • Scheuerer, M., and T. M. Hamill, 2018: Generating calibrated ensembles of physically realistic, high-resolution precipitation forecast fields based on GEFS model output. J. Hydrometeor., 19, 16511670, https://doi.org/10.1175/JHM-D-18-0067.1.

    • Search Google Scholar
    • Export Citation
  • Scheuerer, M., T. M. Hamill, B. Whitin, M. He, and A. Henkel, 2017: A method for preferential selection of dates in the Schaake shuffle approach to constructing spatiotemporal forecast fields of temperature and precipitation. Water Resour. Res., 53, 30293046, https://doi.org/10.1002/2016WR020133.

    • Search Google Scholar
    • Export Citation
  • Taillardat, M., and O. Mestre, 2020: From research to applications – Examples of operational ensemble post-processing in France using machine learning. Nonlinear Processes Geophys., 27, 329347, https://doi.org/10.5194/npg-27-329-2020.

    • Search Google Scholar
    • Export Citation
  • Zhao, P., Q. J. Wang, W. Wu, and Q. Yang, 2022: Spatial mode-based calibration (SMoC) of forecast precipitation fields from numerical weather prediction models. J. Hydrol., 613, 128432, https://doi.org/10.1016/j.jhydrol.2022.128432.

    • Search Google Scholar
    • Export Citation
Save
  • Bellier, J., G. Bontron, and I. Zin, 2017: Using meteorological analogues for reordering postprocessed precipitation ensembles in hydrological forecasting. Water Resour. Res., 53, 10 08510 107, https://doi.org/10.1002/2017WR021245.

    • Search Google Scholar
    • Export Citation
  • Bureau of Meteorology, 2019: APS3 upgrade of the ACCESS-G/GE Numerical Weather Prediction system. NOC Operations Bulletin 125, 68 pp., http://www.bom.gov.au/australia/charts/bulletins/opsbull_G3GE3_external_v3.pdf.

  • Chen, J., T. Janke, F. Steinke, and S. Lerch, 2024: Generative machine learning methods for multivariate ensemble postprocessing. Ann. Appl. Stat., 18, 159183, https://doi.org/10.1214/23-AOAS1784.

    • Search Google Scholar
    • Export Citation
  • Clark, M., S. Gangopadhyay, L. Hay, B. Rajagopalan, and R. Wilby, 2004: The Schaake Shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. J. Hydrometeor., 5, 243262, https://doi.org/10.1175/1525-7541(2004)005<0243:TSSAMF>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Feldmann, K., M. Scheuerer, and T. L. Thorarinsdottir, 2015: Spatial postprocessing of ensemble forecasts for temperature using nonhomogeneous Gaussian regression. Mon. Wea. Rev., 143, 955971, https://doi.org/10.1175/MWR-D-14-00210.1.

    • Search Google Scholar
    • Export Citation
  • Flowerdew, J., 2014: Calibrating ensemble reliability whilst preserving spatial structure. Tellus, 66A, 22662, https://doi.org/10.3402/tellusa.v66.22662.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., L. I. Stanberry, E. P. Grimit, L. Held, and N. A. Johnson, 2008: Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. TEST, 17, 211235, https://doi.org/10.1007/s11749-008-0114-x.

    • Search Google Scholar
    • Export Citation
  • Hu, Y., M. J. Schmeits, S. J. van Andel, J. S. Verkade, M. Xu, D. P. Solomatine, and Z. Liang, 2016: A stratified sampling approach for improved sampling from a calibrated ensemble forecast distribution. J. Hydrometeor., 17, 24052417, https://doi.org/10.1175/JHM-D-15-0205.1.

    • Search Google Scholar
    • Export Citation
  • Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF ensemble prediction system: Methodology and validation. Quart. J. Roy. Meteor. Soc., 122, 73119, https://doi.org/10.1002/qj.49712252905.

    • Search Google Scholar
    • Export Citation
  • Price, I., and S. Rasp, 2022: Increasing the accuracy and resolution of precipitation forecasts using deep generative models. Proc. Mach. Learn. Res., 151, 10 55510 571.

    • Search Google Scholar
    • Export Citation
  • Roberts, N., and Coauthors, 2023: IMPROVER: The new probabilistic postprocessing system at the Met Office. Bull. Amer. Meteor. Soc., 104, E680E697, https://doi.org/10.1175/BAMS-D-21-0273.1.

    • Search Google Scholar
    • Export Citation
  • Schefzik, R., T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty quantification in complex simulation models using ensemble copula coupling. Stat. Sci., 28, 616640, https://doi.org/10.1214/13-STS443.

    • Search Google Scholar
    • Export Citation
  • Scheuerer, M., and T. M. Hamill, 2018: Generating calibrated ensembles of physically realistic, high-resolution precipitation forecast fields based on GEFS model output. J. Hydrometeor., 19, 16511670, https://doi.org/10.1175/JHM-D-18-0067.1.

    • Search Google Scholar
    • Export Citation
  • Scheuerer, M., T. M. Hamill, B. Whitin, M. He, and A. Henkel, 2017: A method for preferential selection of dates in the Schaake shuffle approach to constructing spatiotemporal forecast fields of temperature and precipitation. Water Resour. Res., 53, 30293046, https://doi.org/10.1002/2016WR020133.

    • Search Google Scholar
    • Export Citation
  • Taillardat, M., and O. Mestre, 2020: From research to applications – Examples of operational ensemble post-processing in France using machine learning. Nonlinear Processes Geophys., 27, 329347, https://doi.org/10.5194/npg-27-329-2020.

    • Search Google Scholar
    • Export Citation
  • Zhao, P., Q. J. Wang, W. Wu, and Q. Yang, 2022: Spatial mode-based calibration (SMoC) of forecast precipitation fields from numerical weather prediction models. J. Hydrol., 613, 128432, https://doi.org/10.1016/j.jhydrol.2022.128432.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Ensemble realization 40 for ECMWF at valid time 0000 UTC 2 Oct 2021. Plots depict the raw ensemble, standard ECC, S-ECC, N-ECC, tricube ECC, and regularized tricube ECC. The color scale is in millimeters.

  • Fig. 2.

    Ensemble mean forecasts for ECMWF at valid time 0000 UTC 2 Jul 2022. The color scale is in millimeters. Note that the N-ECC forecast has slightly higher intensity for the center of the rain region over the southeast coast.

  • Fig. 3.

    (top) Reliability curves for ECMWF at 1- and 10-mm thresholds, with error bars estimated from sample size. The x axis is the forecast probability (grouped into 20 equally spaced bins and averaged within each bin) and the y axis is the observed probability for forecasts in the corresponding bin. (bottom) The distribution of forecast probabilities, grouped into the same bins. Note the logarithmic scale of the y axis.

  • Fig. 4.

    Ensemble realization 40 for ECMWF at valid time 0000 UTC 2 Oct 2021.

  • Fig. 5.

    Ensemble mean for ECMWF at valid time 0000 UTC 2 Jul 2022.

All Time Past Year Past 30 Days
Abstract Views 100 100 0
Full Text Views 1308 1308 942
PDF Downloads 167 167 9