Evaluation of Probabilistic Forecasts of Binary Events with the Neighborhood Brier Divergence Skill Score

Joël Stein aMétéo-France, Toulouse, France

Search for other papers by Joël Stein in
Current site
Google Scholar
PubMed
Close
and
Fabien Stoop aMétéo-France, Toulouse, France

Search for other papers by Fabien Stoop in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

A procedure for evaluating the quality of probabilistic forecasts of binary events has been developed. This is based on a two-step procedure: pooling of forecasts on the one hand and observations on the other hand, on all the points of a neighborhood in order to obtain frequencies at the neighborhood length scale and then to calculate the Brier divergence for these neighborhood frequencies. This score allows the comparison of a probabilistic forecast and observations at the neighborhood length scale, and therefore, the rewarding of event forecasts shifted from the location of the observed event by a distance smaller than the neighborhood size. A new decomposition of this score generalizes that of the Brier score and allows the separation of the generalized resolution, reliability, and uncertainty terms. The neighborhood Brier divergence skill score (BDnSS) measures the performance of the probabilistic forecast against the sample climatology. BDnSS and its decomposition have been used for idealized and real cases in order to show the utility of neighborhoods when comparing at different scales the performances of ensemble forecasts between themselves or with deterministic forecasts or of deterministic forecasts between themselves.

Significance Statement

A pooling of forecasts on the one hand and observations on the other hand, on all the points of a neighborhood, is performed in order to obtain frequencies at the neighborhood scale. The Brier divergence is then calculated for these neighborhood frequencies to compare a probabilistic forecast and observations at the neighborhood scale. A new decomposition of this score generalizes that of the Brier score and allows the separation of the generalized resolution, reliability, and uncertainty terms. This uncertainty term is used to define the neighborhood Brier divergence skill score which is an alternative to the popular fractions skill score, with a more appropriate denominator.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Joël Stein, joel.stein@meteo.fr

Abstract

A procedure for evaluating the quality of probabilistic forecasts of binary events has been developed. This is based on a two-step procedure: pooling of forecasts on the one hand and observations on the other hand, on all the points of a neighborhood in order to obtain frequencies at the neighborhood length scale and then to calculate the Brier divergence for these neighborhood frequencies. This score allows the comparison of a probabilistic forecast and observations at the neighborhood length scale, and therefore, the rewarding of event forecasts shifted from the location of the observed event by a distance smaller than the neighborhood size. A new decomposition of this score generalizes that of the Brier score and allows the separation of the generalized resolution, reliability, and uncertainty terms. The neighborhood Brier divergence skill score (BDnSS) measures the performance of the probabilistic forecast against the sample climatology. BDnSS and its decomposition have been used for idealized and real cases in order to show the utility of neighborhoods when comparing at different scales the performances of ensemble forecasts between themselves or with deterministic forecasts or of deterministic forecasts between themselves.

Significance Statement

A pooling of forecasts on the one hand and observations on the other hand, on all the points of a neighborhood, is performed in order to obtain frequencies at the neighborhood scale. The Brier divergence is then calculated for these neighborhood frequencies to compare a probabilistic forecast and observations at the neighborhood scale. A new decomposition of this score generalizes that of the Brier score and allows the separation of the generalized resolution, reliability, and uncertainty terms. This uncertainty term is used to define the neighborhood Brier divergence skill score which is an alternative to the popular fractions skill score, with a more appropriate denominator.

© 2024 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Joël Stein, joel.stein@meteo.fr

1. Introduction

The use of probabilistic forecasts has increased significantly in recent years for numerical weather and climate simulations. To improve these forecasting systems, it is essential to have a robust verification system. In particular, the use of objective scores is an indispensable component of the verification system. The verification reference books of Jolliffe and Stephenson (2011) and Wilks (2011) contain a description of the usual scores for ensemble forecasts that depend on the nature of the parameter being forecast. In the case of continuous parameters such as temperature and rainfall amount, we can quote the continuous rank probability score (CRPS) which compares the predicted and observed cumulative distribution function (CDF). In the case of binary parameters corresponding to events such as the occurrence of fog, hail, tornado, or fixed threshold exceedances for continuous parameters, we can quote the Brier score (BS), the area under the receiver operating characteristic (ROC) curve, and the economic value (Jolliffe and Stephenson 2011).

In the case of the verification of deterministic predictions of binary parameters, most of the scores are based on 2 × 2 contingency tables which classify the observed and predicted events into four categories: correct forecasts of occurrence a, false alarms or incorrect forecasts of occurrence b, misses or incorrect forecasts of nonoccurrence c, and correct forecasts of nonoccurrence d (Jolliffe and Stephenson 2011). Among the contingency scores, we can mention the proportion of correct forecasts, the false alarm rate, the hit rate, the Heidke skill score, and the Peirce skill score (Jolliffe and Stephenson 2011). The traditional way in which to compute these scores is to count each category within the set of points and times where a couple formed by an observation and a forecast is available. The important thing to note is that in the traditional verification, only the prediction at the grid point is checked against the observation at the same point. If a forecast event is shifted from the observed event by a few tens of kilometers, this will produce both a false alarm and a miss in the contingency table. This is also called the double penalty. The conventional use of neighborhoods consists of computing synthetic information (average, maximum, minimum, etc.) from the different points of the neighborhood for the forecasts and/or the observations. If the neighborhood length scale noted ln in the following is larger than the location error between the expected and observed event, taking the neighborhood into account will allow the false alarm and the miss to compensate in the calculation of synthetic parameters. Roberts and Lean (2008) have proposed the application of the neighborhood concept by calculating averaged event frequencies in the neighborhood for forecasts and for observations instead of point values to remove the double penalties. The deviation between these forecasts and observed neighborhood frequencies is measured by computing a mean-square error used to build the fractions skill score (FSS).

Duc et al. (2013) generalized the neighborhood concept for the temporal dimension by considering forecast and observed frequencies within the different selected time points in a fixed time interval. They also applied the neighborhood concept to the case of ensemble forecasts by replacing the binary occurrences of events with the ensemble forecast probabilities at each point in the neighborhood. They then averaged these forecast probabilities on the different points of the neighborhood and compared them to the observed frequencies of the event in the same neighborhood in the same way as Roberts and Lean (2008). They called this new score the ensemble FSS. It was used by Schwartz et al. (2019) in order to evaluate the NCAR’s real-time convection-allowing ensemble.

Mittermaier (2014) proposed a new verification framework named the High-Resolution Assessment (HiRA) framework which was intended to better evaluate deterministic forecasts: these were transformed into probabilistic forecasts through the creation of an ensemble forecast, the members of which were made up of the forecasts at different points in a neighborhood. The probabilities forecast by this ensemble therefore represented average frequencies predicted in the neighborhood called neighborhood ensemble probabilities (NEPs) by Schwartz and Sobash (2017). The size of the neighborhood and the choice of the probabilistic score were tailored according to the parameter to be verified in HiRA. This framework was then generalized by Mittermaier and Csima (2017) so as to allow comparison between deterministic and ensemble forecasts by replacing the predicted point probabilities with the average of the predicted probabilities at each point in the neighborhood as in Duc et al. (2013). As with the deterministic case, the size of the neighborhood and the choice of the probabilistic score could vary in HiRA depending on the parameter and the model to be verified.

Stein and Stoop (2022) proposed the application of the neighborhood concept to the CRPS. To achieve this, they considered a two-step verification procedure: first, the forecasts of the M members of the ensemble forecast made on the N points of the neighborhood were pooled to form a neighborhood CDF and then either the CRPS was used to measure its deviation from the observation at the center of the neighborhood or the divergence function dIQ associated with the CRPS (integrated quadratic distance) was used to measure its deviation from the neighborhood CDF of the observations (Thorarinsdottir et al. 2013). In the first case, it therefore measured the interest of taking into account the forecasts of the other points to improve the forecast at the center of the neighborhood, while in the second case, it measured the quality of the forecast at the neighborhood scale superior to the grid scale of the forecasts and the observations.

The purpose of this paper is to show how the methodology used for the dIQ can also be used for the Brier divergence which is the divergence function associated with the BS (Thorarinsdottir et al. 2013) so as to reformulate the ensemble FSS of Duc et al. (2013) and then to present the generalization of the BS decomposition to the Brier divergence. The theoretical framework will be developed in section 2. Section 3 will be devoted to the decomposition of the Brier divergence. Section 4 will describe idealized cases not only to show how this score is sensitive to the bias and the dispersion of the probabilistic forecast but also to the spatial range of the correlation between observations. Section 5 will present the results of an intercomparison between two ensemble forecasts at high and low resolutions as well as two deterministic forecasts at a higher resolution than the members of the associated ensemble forecast. It contains a comparison of the new score to the ensemble FSS. Section 6 will contain the main conclusions of this study.

2. Consideration of the neighborhood in the Brier divergence

We consider a probabilistic forecast which provides the empirical expected frequency f of the event which belongs to the interval [0, 1]. In the case of an ensemble forecast of M members, f=t/M when t members forecast the event. The BS compares f to the observation of the event noted o (it is either 1 when the event is observed or 0 in the opposite case) by computing the squared error, to give
BS=(fo)2.
This calculation is performed at each point of the spatial domain of verification and for each instant of the selected time period. The average value BS¯ is obtained by averaging all these individual calculations.

The BS is a negatively oriented score: the smaller it is, the better the prediction is. It is a proper score which allows the ranking of the forecasts among themselves and ensures that the best forecast is the one which most closely resembles the same distribution as the observations (Bröcker and Smith 2007).

To test what information the ensemble forecast brings for a given scale, a preliminary step of pooling of the forecasts is performed by putting together all the forecasts made at N points in the neighborhood. The frequency of the event f is therefore replaced by the frequency of the event fn (n for neighborhood) averaged in the neighborhood given by
fn=1Ni=1Nf(xi),
where xi represents the ith point in the neighborhood and f(xi) is the empirical frequency forecast at that point. For ensemble forecasts, this amounts to replacing the ensemble of M forecasts by an ensemble of MN forecasts by assuming that the forecasts of the neighboring points are indistinguishable from each other. According to the terminology of Ebert (2008), this is a neighborhood forecast in the sense that the fn forecast has been averaged over the neighborhood and they correspond to the NEP of the pooled forecast probabilities of Schwartz and Sobash (2017) or to the forecast fraction inside the neighborhood of Duc et al. (2013).
As for Roberts and Lean (2008) in the deterministic case, the same pooling procedure will be applied to the observations. We calculate the observed empirical frequency of the event on in the same way as fn:
on=1Ni=1No(xi),
where o(xi) represents the value of the observation (0 or 1) at the point xi. The on is thus a neighborhood observation since it is averaged over the neighborhood (Ebert 2008) and corresponds to the observation fraction inside the neighborhood of Duc et al. (2013). Because it can take fractional values which differ from 0 and 1, the BS must be adapted by switching to the Brier divergence BD which allows the direct comparison of distributions of probabilities (Thorarinsdottir et al. 2013) defined by
BD(F,G)=(fg)2,
where F and G are two probability distributions while f and g are their local estimates. The term BD corresponds exactly to the fraction Brier score (FBS) introduced by Roberts (2008) for deterministic forecasts and generalized for ensemble forecasts by Duc et al. (2013).
It is important to note that the BD is a proper divergence in the sense of Thorarinsdottir et al. (2013), i.e., optimized when the forecast distribution is the same as the observed distribution. This Brier divergence is therefore computed so as to compare the observed and predicted empirical distributions obtained from averaging the neighborhood. To summarize, the evaluation of the prediction is performed in two steps: 1) pooling of the N observed values and the N predicted empirical frequencies and 2) calculation of the Brier divergence from the frequencies averaged over the neighborhood. This expression for the Brier divergence will be denoted BDn to show that the divergence is computed after pooling the distributions on the neighborhood and is written as
BDn(F,O)=BD(Fn,On)=(fnon)2,
where Fn and On are the probability distributions after the pooling of the forecasts and the observations. The average value BDn¯ is obtained by averaging this score on all the points and instants available for the verification. The average value FSS¯ of the ensemble FSS of Duc et al. (2013) can be written with our notation as
FSS¯(F,O)=1BDn¯(F,O)fn2¯+on2¯.
The deterministic limit of the BDn is obtained by considering an ensemble forecast of one member which delivers only binary frequencies. In this case, the BDn degenerates into the FBS of Roberts (2008). In the case where the neighborhood is also reduced to a single point, the BDn¯ and consequently the FBS¯ are equal to
BDn¯=b+ca+b+c+d=1PC,
where PC=(a+d)/(a+b+c+d) is the proportion of correct forecasts (Jolliffe and Stephenson 2011).

To respect the independence of the different draws in the distribution of the observations, it is more rigorous to choose the verification points so that their respective neighborhoods are disjointed. This results in a reduction in the number of points used to calculate BDn¯ compared to using all possible (overlapping) neighborhoods. We test the impact of varying by a few points all disjoint neighborhoods in both directions on the estimated value of BDn¯. We select two spatial shifts equal to ln/3 and 2ln/3 in both spatial directions and obtain a distribution of nine values for BDn¯. The case where all the neighborhoods are taken into account by moving the neighborhood as a sliding window from point to point (the neighborhoods are then not disjoint) gives a value of BDn¯ included in the range of variation of the nine cases with disjoint neighborhoods (not shown). This last way of calculating BDn¯ is more expensive in computation time but can be useful for the increasing of the number of verification cases in order to calculate the average value of BDn¯. The points corresponding to neighborhoods which would exceed the simulation domain are not taken into account.

3. The BDn¯ decomposition

The BS decomposition was introduced by Murphy (1973) in the case where one decomposes the summation over all verification points so as to obtain BS¯ according to the different subsets where the forecast probability is constant. It is a common practice to calculate the BS¯ components by first partitioning the issued probabilities into a fixed set of bins corresponding to m disjoint intervals covering the interval [0, 1] as proposed by Stephenson et al. (2008). We use the subscript k for the m different bins and nk for the number of issued probabilities belonging to the bin k. The expression n=k=1mnk represents the total number of elementary verifications. Finally, fkj is the forecast probability for the jth case, the value of which is in the kth bin. The average value of these nk values for bin k is given by fk¯=(1/nk)j=1nkfkj. Similarly, the observations are noted as okj for j = 1, 2, …, nk when they correspond to forecast probabilities fkj belonging to this bin k. The observed average frequency of the event is given by o¯=(1/n)k=1mnkok¯, where ok¯=(1/nk)j=1nkokj.

The average Brier score BS¯ is decomposed (Stephenson et al. 2008) according to
BS¯=1nk=1mj=1nk(fkjokj)2=UNC+RELGRES,
where
UNC=o¯(1o¯),REL=1nk=1mnk(fk¯ok¯)2,GRES=RESWBV+WBC,RES=1nk=1mnk(ok¯o¯)2,WBV=1nk=1mj=1nk(fkjfk¯)2,WBC=2nk=1mj=1nk(fkjfk¯)(okjok¯).
In this decomposition, we find the uncertainty term UNC which depends only on the observations. The reliability term REL characterizes the agreement between the average value of the probability issued for a given bin and the average value of the observed frequencies in this case. It should be noted that small values of REL correspond to good quality forecasts. The term RES characterizes the ability of the ensemble forecast to deviate correctly from the observed mean frequency. RES is completed by two additional terms in order to form the generalized resolution GRES: these two terms measure the fluctuations within each bin in terms of the variances of the forecasts (WBV for within bin variance) and their covariances with the observations (WBC for within bin covariance). It should be noted that large values of GRES and RES correspond therefore to good quality forecasts.
Candille and Talagrand (2008) showed that the decomposition of Murphy (1973) can be generalized to the case where the observed frequencies can differ from 0 to 1 with the single condition to replace the uncertainty term UNC by
UNC=o2¯o¯2,
where o2¯=(1/n)k=1mnkok2¯=(1/n)k=1mj=1nk(okj)2 is the mean of the squared observations and ok2¯ is the mean of the squared observations for bin k. This is precisely the case when we consider observed frequencies on averaged on the neighborhood. We can then apply the same steps so as to compute BDn¯ by applying this decomposition according to the m bins of values applied to fn instead of f. It is appropriate however to replace in the formulas fkj and okj by their equivalents after the pooling step in the neighborhood fnkj and onkj. It leads to the final expressions:
BDn¯=UNC+RELGRES,UNC=on2¯on¯2,REL=1nk=1mnk(fnk¯onk¯)2,GRES=RESWBV+WBC,RES=1nk=1mnk(onk¯on¯)2,WBV=1nk=1mj=1nk(fnkjfnk¯)2,WBC=2nk=1mj=1nk(fnkjfnk¯)(onkjonk¯).
When we consider the trivial forecast provided by the climatology of the sample, the forecast probability is constant and equal to the mean observed frequency on¯, we find
BDn¯(climatology)=UNC.
We can use this trivial forecast to build the neighborhood Brier divergence skill score BDnSS¯. This skill score and its decomposition are equal to
BDnSS¯=1BDn¯BDn¯(climatology)=GRESUNCRELUNC.
This skill score BDnSS¯ uses the climatology of the sample as reference forecast which only depends on the observations unlike the FSS¯ given by Eq. (6). It is a strong advantage of this decomposition to provide this reference forecast which allows the use of BDnSS¯ to directly compare different models between them. BDnSS can take values between −∞ and 1. A positive (respectively, negative) value corresponds to a better (respectively, worse) forecast than the climatological forecast. BDnSS = 1 corresponds to a perfect, error-free forecast. Moreover, the reliability and resolution terms keep the same meaning as in the Brier score decomposition. The reliability term REL measures the mean agreement over the different bins between squared neighborhood mean frequencies predicted fnk¯ and observed onk¯. The resolution term measures the ability of the ensemble forecast to deviate appropriately from the sample climatological forecast through a root-mean-square difference between on¯ and onk¯ for the different bins. The complementary terms WBV and WBC measure the impacts in terms of resolution brought about by taking into account the values of the forecast neighborhood frequency fnkj which differ from the average value fnk¯ in the bin k.

The parameter m is arbitrary in the decomposition. We will choose bins centered on the values that the expected frequencies would take in the case of an ensemble forecast with M members: [0,(0.5/M)[,[(0.5/M),(1.5/M)[,,[(M1.5)/M,(M0.5)/M[,[(M0.5)/M,1]. In this case, the number of bins is m = M + 1. To compare two ensemble forecasts with different numbers of members, we will choose the larger of these two numbers for m. For the comparison of an ensemble forecast of M members with a deterministic forecast, it leads to choose a number of bins of this decomposition which is equal to m = M + 1 for both forecasts in order to easily compare the different terms of the decomposition.

4. Application to idealized cases

a. Numerical setup

We study the behavior of BDnSS¯ in a series of realistic idealized experiments controlled by explicit key parameters. This more theoretical framework will enable us to verify that the results are consistent and provide relevant information on the quality of ensemble predictions.

The idealized cases are all built following the same scheme. We use simulations of Gaussian random fields to generate the part that is perfectly predictable and therefore a common background for observation and forecast fields. We choose an exponential covariance model without the nugget effect, for which the variance parameter σc2 and range parameter Lc vary from experiment to experiment. The covariance C between the points x1 and x2 of the simulation domain is given by
C(x1,x2)=σc2exp(x1x22Lc2),
where ǁ ǁ is the Euclidian distance.

The spatial domain contains 400 × 800 points. Data are simulated using R (R Core Team 2019) and geoR (Ribeiro et al. 2020) libraries. The observations are obtained by adding to the common background a complement that is more difficult to predict and which is drawn randomly from a centered normal distribution N(0,σo2), where σo is its standard deviation. The draws of the observation complement are performed independently at each point and time from the single distribution N(0,σo2). Figure 1 shows two examples of observation fields constructed with σc = 1 and σo = 0.2 and for values of Lc corresponding to 80 and 2 grid points. We note that the visible structures have sizes on the order of a few Lc. Thus, the field for Lc = 2 looks like white noise and appears much noisier than the one obtained for Lc = 80. The field for Lc = 80 looks plausibly like a meteorological field of rain or cloud.

Fig. 1.
Fig. 1.

Observation spatial field at one time constructed with a statistical distribution corresponding to σc = 1, σo = 0.2, and (a) Lc = 80 and (b) Lc = 2. A black circle with a radius equal to Lc is drawn at the center of the domain. Forecast spatial field of one member of the ensemble forecast built at the same time and with the statistical distribution corresponding to σc = 1, μf = 0.0, σf = 1.0, and (c) Lc = 80 and (d) Lc = 2.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

The probabilistic forecasts are constituted by an ensemble forecast of 35 members. Each of the 35 forecasts is obtained in the same way as the observations: we add to the common background a complement drawn randomly from a normal distribution N(μf,σf2), where μf is the value of its mean and σf its standard deviation. The draws of the forecast complement are performed independently at each point and time from the single distribution N(μf,σf2). For the sake of simplicity, the forecasting error is modeled using a homoscedastic approach to limit the number of parameters needed to control it (μf and σf are the same for all grid points at all times). The forecast of one member of the ensemble forecast is shown in Figs. 1c and 1d for the same day as the observation discussed just before, using the same common part and a complement drawn from the N(μf=0,σf2=1.0) distribution. We observe that the forecast complement is more visible on the total field than the observation complement, since it is on average 5 times stronger. For Lc = 80, we can visually find the common part in Figs. 1a and 1c, whereas the fields are too noisy to do so for Lc = 2 in Figs. 1b and 1d. To have statistically stable results, we iterate the process of generating observations and forecasts to constitute a set of 100 realizations.

b. Reference experiment EXP_ref

The reference experiment EXP_ref corresponds to the choices: Lc = 80, σc = 1, and σo = 0.2 for the observations and μf = 0 and σf = 0.2 for the forecasts. It corresponds to correlated observations over large distances to be compared with an unbiased probabilistic forecasting system giving forecast errors on average nearly 5 times smaller than the variability of the observations. Note that the forecasts of this reference experiment are drawn in the same distribution as the observations since μf = 0 and σf = σo. The chosen event is “the parameter being greater than 1.72”; this threshold has been chosen so that the observation mean frequency of occurrence is equal to 4.5%.

We compute the BDn¯ on the whole domain of 400 × 800 points for 100 realizations for different values ln of the neighborhood length scale (related to the number of points in a square neighborhood by N=ln2). We use the forecast frequencies fn¯k in order to build a sharpness diagram (Jolliffe and Stephenson 2011) for the 36 bins (Fig. 2a).

Fig. 2.
Fig. 2.

Experiment EXP_ref: (a) sharpness diagram formed by plotting the number nk of forecast cases falling in bin k against the central predicted frequency of bin k for different neighborhood length scales ln: 1, 3, 5, 9, 15, 21, 35, and 49 grid points. The number of bins is 36 since the ensemble forecast contains 35 members. The vertical scale of the plot is logarithmic. (b) Variations of the terms of the decomposition of BDn¯ are plotted as a function of the neighborhood length scale ln: the BDn¯ in a black solid line, RES in a green dotted line, WBV in a brown dotted line, WBC in an orange dotted line, REL in blue dashed line, and UNC in a red solid line. (c) Variations as a function of the neighborhood length scale ln of BDnSS¯ (full black line), GRES/UNC (dotted red line), and REL/UNC (dashed red lines) are plotted. Each symbol (disk or “x”) corresponds to a numerical experiment.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

The sharpness diagram shows that all bins of the predicted probability of the event are used. When considering larger and larger neighborhood sizes, the number of cases in each bin decreases because it is nearly proportional to the inverse of the number of points of the neighborhood ln2. This is due to the use of disjoint neighborhoods. We also note that the variation of nk with k becomes noisier for large values of ln because the bins are less populated.

The term BDn¯ and the different terms of its decomposition are plotted in Fig. 2b. UNC, which can also be written as UNC=(onon¯)2¯, represents the variance of on. UNC decreases with ln because the pooling operation reduces the variability of the frequencies averaged over the neighborhood. The resolution term RES exactly follows the same decrease and quantifies the quality of the forecast that is able to predict the variation of the observations by construction. We can also note that the variability terms within the bin WBV and WBC become relatively more and more important with respect to RES as ln increases but remain negligible against RES. REL is also negligible compared to the main terms in the decomposition for all neighborhoods used because drawing forecasts and observations from the same statistical distribution prevents reliability errors. The term BDn¯ decreases with ln because GRES and UNC are closer together the larger ln is.

The observed data are spatially correlated over distances of the order of Lc and this correlation is fully reflected in the ensemble forecast by construction. Thus, the forecasts of the neighboring points are informative for the forecast at the central point. This is the basic postulate of the neighborhood technique which can be applied: the frequencies averaged over the neighborhood fn and on of size less than Lc will then be closer than their local values f and g at the center of the neighborhood. This leads to a reduction of BDn¯ that is the sum of the squares of these differences and therefore an improvement of the forecast accuracy at the neighborhood scale.

We turn to BDnSS¯ and its decomposition in terms of resolution and reliability. For EXP_ref, BDnSS¯ increases with ln (Fig. 2c) and is always nearly equal to GRES/UNC while REL/UNC is negligible. This presentation in terms of relative Brier distances highlights the relative importance of REL/UNC and GRES/UNC in the BDnSS¯ variation as a function of ln for EXP_ref.

We tested the impact of varying all disjoint neighborhoods by a few points in both directions on the computation of BDn¯ for nine couples of values and found a standard deviation for BDn¯ and decomposition terms on the order of 1% for the nine realizations when choosing ln = 21. The use of a sliding window (with overlapping neighborhoods) gives estimates of BDn¯ and all the decomposition terms which are close to the averages of the nine realizations using disjoint neighborhoods and are separated from them by less than half of the standard deviation of the nine realizations.

c. Variation according to the standard deviation σf

In a series of experiments, we vary the standard deviation of the forecast from σf = 0 to σf = 1. The other parameters are similar to those of EXP_ref: Lc = 80, σc = 1.0, σo = 0.2, and μf = 0. The σf = 0 corresponds to a deterministic forecast, since the dispersion of the ensemble is zero and the forecast is always equal to the background common to observations and forecasts. The σf = 0.2 corresponds to EXP_ref. The σf = 1 corresponds to a greater dispersion of forecasts than for EXP_ref and is on the same order of magnitude as the variability of the common background.

We define the frequency bias Bn at the neighborhood scale by Bn=fn¯/on¯. The term Bn is independent of the neighborhood size ln. The term Bn increases monotonically with σf, as shown in Fig. 3b, from Bn = 0.93 for σf = 0 to Bn = 2.39 for σf = 1. To understand this underprediction (respectively, overprediction) of the event for σf < 0.2 (respectively, σf > 0.2) compared with the unbiased EXP_ref forecast, we must return to its construction: The forecasts are obtained at each point by adding two random variables, one being the common background drawn from the distribution using an exponential covariance model with the spatial distance and the other being the forecast complement drawn from the distribution N(0,σf2). The forecast then follows a distribution narrower (respectively, wider) with respect to the distribution of observed values. This leads to a systematic underprediction (respectively, overprediction) of events since the predicted distribution tends to 0 more (respectively, less) quickly for high values of the parameter than the observed distribution.

Fig. 3.
Fig. 3.

(a) Variation of the BDnSS¯ as a function of the neighborhood length scale ln for experiments using the same parameters as EXP_ref except σf = 0 (orange line), σf = 0.1 (yellow line), σf = 0.2 (black line for EXP_ref), σf = 0.3 (green line), σf = 0.6 (red line), and σf = 1.0 (blue line). Variations as a function of σf for BDn¯/UNC (black solid line), 1 − GRES/UNC (red dotted line), REL/UNC (blue dashed line), and Bn (green dotted-dashed line) are plotted for (b) ln = 1 and for (c) ln = 49. The vertical scale of (b) and (c) is logarithmic.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

The evolutions of BDnSS¯ with ln are superimposed for all σf values in Fig. 3a. As with EXP_ref, BDnSS¯ increases with ln but taking into account the neighborhood does not allow the improvement of BDnSS¯ (Fig. 3a) beyond a maximum value which corresponds to σf = 0.2 or to EXP_ref. The EXP_ref corresponds to using the same statistical distribution for the forecasts as for the observations, since σo = 0.2. The BDn¯ is therefore minimal in this configuration because BDn is a proper score and BDnSS¯ is maximal for the same reason.

To analyze in greater depth the causes of the deterioration in forecasts in relation to EXP_ref, we have superimposed for ln = 1 in Fig. 3b and for ln = 49 in Fig. 3c, the evolutions of 1BDnSS¯=BDn¯/UNC as a function of σf. We also add its decomposition in two terms 1(GRES/UNC) and REL/UNC which is deduced from Eq. (12):
BDn¯UNC=(1GRESUNC)+RELUNC.
All three terms are negatively oriented: a decrease in their value corresponds to an improvement in the forecast. For ln = 1, we can see that the deterioration in forecasts is due to the reliability term, which is greater than in the EXP_ref case. The cause is the widening of the predicted distribution with respect to the observed distribution for σf > 0.2 and the narrowing of the predicted distribution for σf < 0.2. For ln = 49, REL/UNC varies with σf in a similar way to ln = 1. On the other hand, 1(GRES/UNC) is no longer constant when σf varies but also shows a minimum for EXP_ref. Thus, the relative minimum for BDn¯/UNC is more pronounced for EXP_ref, showing that at the neighborhood scale ln = 49, reliability and resolution vary in the same way as a function of σf to degrade the forecast in relation to EXP_ref.

d. Variation according to the range Lc

The range parameter Lc was equal to 80 grid points for EXP_ref; we now vary this parameter by taking smaller values: 2, 8, 16, and 32 grid points while keeping the other parameters equal to those of EXP_ref: σc = 1, σo = 0.1, μf = 0, and σf = 0.2. Decreasing Lc corresponds to taking observations correlated over smaller distances. The BDnSS¯ is plotted as a function of ln for these five experiments in Fig. 4a. The five BDnSS¯ variations are similar with a saturation of BDnSS¯ toward a value close to 1 except for Lc = 2 where the asymptotic value is lower.

Fig. 4.
Fig. 4.

(a) Variation of the BDnSS¯ on as a function of the neighborhood length scale ln for experiments where the range parameter Lc is varied: Lc = 80 for EXP_ref (black line), Lc = 32 (green line), Lc = 16 (purple line), Lc = 8 (orange line), and Lc = 2 (blue line). Variations as a function of Lc for BDn¯/UNC (black solid line), 1 − GRES/UNC (red dotted line), REL/UNC (blue dashed line), and Bn (green dotted-dashed line) are plotted for (b) ln = 1 and (c) ln = 49. The vertical scale of (b) and (c) is logarithmic.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

We use the decomposition of BDn¯/UNC in order to discuss the variation of the resolution and reliability terms with Lc for the extreme values ln = 1 and ln = 49 (Figs. 4b,c). REL/UNC remains negligible compared with 1 − GRES/UNC for all these experiments, as they are all unbiased for the event under consideration (Bn = 1 is also plotted in Figs. 4b,c). In fact, forecasts and observations are drawn from the same statistical distribution for each experiment. When ln = 1, there is no variation in the normalized scores with Lc. This is no longer the case for ln = 49, since taking into account forecasts and observations correlated over greater or lesser distances influences the value of 1 − GRES/UNC and therefore BDn¯/UNC and BDnSS¯. Note that it is the GRES resolution term that improves faster than UNC as Lc increases. This shows that these forecasts are more informative than climatology when Lc increases for a given verification scale ln.

e. Variation according to the bias of the forecast

Two complementary experiments EXP_B+ and EXP_B− were performed in order to document the impact of a prediction additive bias by taking distributions of the prediction complement given by N(+0.1, 0.22) and N(−0.1, 0.22), respectively, and a distribution of the observation complement given by N(0.0, 0.22). All the other parameters are the same as for EXP_ref. The positive or negative additive bias corresponds nearly to ±10% of the standard deviation of the observations. We quantify the impact of these additive biases by calculating for the three experiments the frequency bias Bn for the event corresponding to the threshold 1.72: Bn = 1.22 for EXP_B+, Bn = 1.0 for EXP_ref, and Bn = 0.81 for EXP_B−.

For the biased experiments, we observe BDnSS¯ grows as a function of ln as for EXP_ref but the maximum values reached are lower (Fig. 5a). The BDnSS¯ values of experiments EXP_B+ and EXP_B− are close. We have also plotted in Fig. 5a the evolution of FSS¯ for the three experiments as a function of the length scale of the neighborhood ln. We see that FSS¯ has the same behavior as BDnSS¯ with better scores for the unbiased experiments and also a saturation for the largest neighborhoods. The FSS¯ values are stronger than BDnSS¯ because the normalization used by FSS¯ is easier to beat than the climatological forecast used by BDnSS¯.

Fig. 5.
Fig. 5.

(a) Variation of the BDnSS¯ (full lines) and FSS¯ (dashed-dotted lines) as a function of the neighborhood length scale ln for experiments EXP_ref (black line), EXP_B+ (orange line), and EXP_B− (green line) for Lc = 80. (b) The same parameters are plotted for the same experiments except Lc = 2. Variations of the terms of the BDnSS¯ decomposition as a function of the neighborhood length scale ln: (c) GRES/UNC and (d) REL/UNC in full lines for Lc = 80 and dotted lines for Lc = 2. Each symbol corresponds to a numerical experiment.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

The BDnSS¯ decomposition reveals which term is responsible for this improvement as ln increases. The GRES/UNC for the three experiments remains very close when ln increases (Fig. 5c) because the three predictions follow by construction the observation when it varies, in spite of the systematic shift when the bias is different from 0. The GRES/UNC improves as ln increases, while REL/UNC remains almost constant (Fig. 5d). We can conclude that it is still the resolution term that is the source of the improvement in forecast performance over climatological forecasting, thanks to the correlation of forecast values within the neighborhood when ln < Lc.

We now consider the case of biased forecasts for which we take a neighborhood size ln for verification that is large compared with Lc. We choose Lc = 2 and plot BDnSS¯ as a function of ln in Fig. 5b. The BDnSS¯ variations are very important in comparison to the case Lc = 80 only for the biased experiments since the saturation of BDnSS¯ toward a value close to 1 is replaced, for the range of ln explored, by a growth toward a maximum for low values of ln followed by a decrease for high values of ln. For Lc = 2, the maximum of BDnSS¯ is reached for a value of ln = 7 (value obtained thanks to complementary experiments to those plotted in Fig. 5). The decrease in BDnSS¯ for high ln values is explained by the inclusion in the large neighborhoods of uncorrelated observations whose average effect in terms of averaged frequency of observations on is to reduce the variability by getting closer to the climatological value. The expression BDn¯(climatology)=UNC tends therefore toward 0 for very large neighborhoods for Lc = 2. This tendency to get closer to the climatology also applies to forecasts that follow observations by construction even when they are systematically offset. This is quite visible in the sharpness diagram for Lc = 2 (Fig. 6) giving the distribution of the predicted frequencies fn averaged in the neighborhood that cluster around the climatological frequency of the considered event.

Fig. 6.
Fig. 6.

Sharpness diagram with the same legend as Fig. 2a but for an experiment with Lc = 2, σc = 1, σo = 0.2, μf = 0, and σf = 0.2.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

The experiment with μf = 0.1 is of lower quality than the equivalent experiment with μf = −0.1 when comparing BDnSS¯ for Lc = 2 because BDn is not an equitable score as has been noted by Ben Bouallègue et al. (2018) in their Fig. 4. This is a very different situation when compared with the case where Lc = 80, for which the difference between BDnSS¯ for positively and negatively biased experiments remained very small. On the contrary, FSS¯ shows little change from the Lc = 80 case with monotonic growth of FSS¯ with ln. Since both BDnSS¯ and FSS¯ use BDn¯ as the numerator, the difference in behavior comes from normalization. The bias of the prediction is taken into account in the normalization term of the FSS¯ [Eq. (6)], and this counterbalances part of the variation of the numerator (BDn¯) with the bias.

The BDnSS¯ decomposition shows REL/UNC increases strongly for the biased experiments as a function of ln (Fig. 5d) with Lc = 2 when it remains negligible for the unbiased experiment. Even if averaging over the spatial neighborhood strengthens the correspondence between fn and on, this leads to a weaker impact than on UNC for the biased experiments. The evolutions of GRES/UNC as a function of ln are comparable for all the three experiments for Lc = 2 and similar to the case Lc = 80. It explains the increase of BDnSS¯ from ln = 1 to ln = 7. Then, the decrease in BDnSS¯ follows the decrease in REL/UNC for larger ln.

To go further, we use an ergodicity assumption to find the asymptotic values of the observed and predicted mean frequencies over the neighborhood for the case of the very large neighborhoods (lnLc) and then derive BDn¯,FSS¯, and BDnSS¯. We then have the following asymptotic expressions for ln → ∞:
ono¯,fno¯Bn,BDn¯o¯2(Bn1)2,FSS¯2Bn1+Bn2,BDn¯(climatology)0,BDnSS¯1o¯2(Bn1)2BDn¯(climatology).
We use the previous experiments with Lc = 2 to compute explicitly the scores for the different values of μf and compare them in Table 1 with their asymptotic formulations.
Table 1.

Different scores computed for the five experiments labeled by the μf value. The other parameters are the same for the five experiments: Lc = 2, σc = 1, σo = 0.2, σf = 0.2, and ln = 49. The asymptotic scores refer to Eq. (15).

Table 1.

Table 1 shows that the asymptotic behavior for large neighborhoods reproduces the large variations of these five experiments. The term FSS¯ depends only on the bias Bn of the forecast, and its formula [Eq. (15)] is invariant to the change Bn replaced by 1/Bn. This corresponds to a symmetrical penalty in FSS¯ for the cases μf = 0.1 and μf = −0.1 because their biases are the inverse of each other.

The BDn(climatology) measures the residual variability of on around by the value of o¯ and is still nonzero for ln = 49. We can thus see by comparing it with the asymptotic or explicitly calculated values of BDn¯ that the forecast deteriorates faster than the climatological forecast when it is biased. These conclusions are confirmed by the decomposition of BDnSS¯, which shows that it is the reliability term that degrades with ln (Fig. 5d). It should be borne in mind that for a given scale, ranking forecasts according to their BDn¯ or BDnSS¯ values is possible because BDn¯ is a proper score. The values of BDnSS¯ will tend toward −∞ when the size of the neighborhood tends toward infinity, as climatological forecasting tends toward perfect forecasting for our numerical setup. This simply indicates that the forecasts considered in this case are still subject to error. This behavior for large neighborhoods is not relevant for real meteorological cases because the ergodic assumption does not hold, as there is variability on all spatial and temporal scales.

To illustrate the difference in behavior between FSS¯ and BDnSS¯, we consider the extreme cases with μf = 10 and μf = −10. They correspond in fact to, respectively, “always forecast the event” and “never forecast the event” for all the members of the ensemble. The strongly positively biased forecast is much more penalized by BDnSS¯ than the strongly negatively biased forecast in the case of our event with a base rate 0.045 less than 0.5. The bias values of these extreme forecasts also explain the FSS¯ values [refer to Mittermaier (2021) for her discussion of FSS¯ for extreme cases]. We can see also that the ranking of the two forecasts is reversed with respect to that given by BDn¯ and BDnSS¯ due to normalization.

To conclude, we have illustrated by these three series of experiments that BDnSS¯ allows the quantification of the quality of the ensemble forecast with a proper score when they are used at a larger scale than the original forecasts and observations. The term BDnSS¯ rewards event predictions close to the locations where they were observed as has been illustrated for cases where spatially correlated observations were to be forecast. The decomposition of this score is useful for the understanding of how it is split between the reliability and the generalized resolution terms as in the case of the classical Brier skill score without neighborhood.

5. Application to real cases

a. Description of the models and observations

For this study, we use the same four forecasting systems operational at Météo-France as in Stein and Stoop (2022):

  1. The global deterministic hydrostatic model ARPEGE uses a stretched horizontal grid ranging from 5 km over France to 24 km over New Zealand (Courtier et al. 1991). Its initial conditions are produced by 4DVAR assimilation cycles over a 6-h time window (Desroziers et al. 2003). In this study, we use outputs on a regular latitude–longitude grid oversampled to 0.025° over Europe.

  2. The ensemble version of the ARPEGE model, called Prévision d’Ensemble Action de Recherche Petite Echelle Grande Echelle (PEARP), consists of 35 members with a horizontal resolution reduced to 7.5 km over France and 32 km over New Zealand (Descamps et al. 2015). The different initial conditions are obtained by a mixture of singular vectors and perturbed members of a set of 50 assimilations. The model error is represented by a random draw for each member among a set of 10 different coherent physics. We use in this study the outputs on a regular latitude–longitude grid oversampled to 0.025° over Europe.

  3. The deterministic nonhydrostatic limited-area model AROME (Seity et al. 2011) uses a 1.3-km horizontal grid over western Europe. The lateral boundary conditions are provided by the ARPEGE model, and the initial conditions stem from a cycle of hourly 3DVAR assimilations. In this study, we use the outputs on a regular latitude–longitude grid at 0.025° over western Europe.

  4. The ensemble version of the AROME model, called Prévision d’Ensemble AROME (PEAROME), consists of an ensemble of 16 perturbed members with a horizontal resolution of 2.5 km over western Europe. Their lateral boundary conditions are provided by a selection of PEARP members. Their initial conditions come from an assimilation ensemble of 25 members at 3.5 km centered around the 3DVAR operational analysis of the deterministic AROME model (Bouttier and Raynaud 2018). The tendencies of some physical parameterizations are perturbed by a random multiplicative factor in order to model the model error. Outputs on a regular latitude–longitude grid at 0.025° over western Europe are used in this study.

Ground rainfall observations cumulated over 6 h are provided by the Analyse par Spatialisation Horaire des Précipitations (ANTILOPE) data fusion product (Laurantin 2008) between radar data from the French radar network and rain gauges. The horizontal resolution of this analysis is 1 km, and the data are averaged on the scale of the forecast outputs to be comparable, i.e., on the latitude–longitude grid at 0.025°.

The comparison between observations and forecasts is performed with the 0000 UTC networks of ARPEGE and PEARP as well as the 0300 UTC networks of AROME and PEAROME, which use asynchronous coupling files from ARPEGE and PEARP of 0000 UTC. The two events chosen for this study are cumulative rainfall over 6 h above the 0.5- and 5-mm thresholds. We will use two different neighborhoods: 0.025° corresponding to a neighborhood reduced to one point of the verification grid and 0.525° which involves 21 × 21 = 441 points. The verification period is spread over a period of 12 months from 1 January 2020 to 31 December 2020.

No restriction on the minimum number of observations present in the chosen neighborhood is imposed, but the mask of observed missing rainfall data is also applied to the forecasts in order to keep the same control points for forecasts and observations at all times. The verification domain therefore corresponds to metropolitan France, extended from the range of the radars (Fig. 7). The observed and forecast frequencies averaged in a neighborhood are therefore computed only for points which are not masked in this neighborhood. To evaluate the statistical significance of the observed differences between two forecasting systems, a block-bootstrap technique (Hamill 1999) is applied to the time series by randomly drawing blocks of 3 days from the 366 days of the verification period and the starting point among the nine possible ones so as to calculate with disjoint neighborhoods.

Fig. 7.
Fig. 7.

Occurrences of the event: (a) accumulated rainfall greater than 0.5 mm in 6 h observed by the ANTILOPE analysis at 1200 UTC 27 Jan are colored purple. Masked areas where no radar data are available are hatched. (b) The mean frequencies over the 0.525° neighborhood of this event computed from the ANTILOPE analysis are plotted with the color palette given at the bottom of the figure. The mean probabilities of the event predicted by PEARP are plotted with the same color palette (c) without neighborhood and (d) with a neighborhood of 0.525°. (e),(f) The same mean probabilities are plotted in the same way for PEAROME. Rectangular boxes are drawn in black to facilitate comparison of the different panels.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

b. Comparison on a single case: 27 January 2020

In this section, we illustrate the calculation of BDn¯ cases on a real case. We choose to look at the rainfall of 27 January 2020, accumulated between 0600 and 1200 UTC and greater than 0.5 mm. Figure 7 shows the occurrence of this event analyzed by ANTILOPE (frequency of occurrence 40%). On this day, a cold front passed rapidly over France, bounded by zones A-1 and A-2; behind it, a rear sky approached the English Channel and England. The probabilities predicted for this event by the PEARP at 0000 UTC 27 January 2020 and the PEAROME at 0300 UTC 27 January 2020 are also shown in Fig. 7, and BDn with no neighborhood or with a neighborhood of 0.525° is presented in Fig. 8. Overall, both models have well forecast the occurrence of this event. PEARP predicts higher probabilities on average (47%) than those predicted by PEAROME (42%). This is related to the tendency of ARPEGE physics, also used in PEARP, to predict small rainfalls too often (Stein and Stoop 2019). In detail,

  • Ahead (A-1) and behind (A-2) the cold front, the spatial dispersion of PEARP members is more marked than that of PEAROME. PEAROME predicts finer structures than PEARP. Without a neighborhood, PEAROME’s BDn is worse than PEARP’s BDn, PEAROME being penalized by the double penalty. Applying a neighborhood of 0.525° reduces the impact of the double penalty more for PEAROME than for PEARP.

  • Over the English Channel (B), PEAROME predicts higher probabilities with finer structures than PEARP, but PEARP probabilities are closer to observed frequencies than those of PEAROME. PEAROME therefore has poorer BDn than PEARP because its probabilities are too high. The application of a neighborhood preserves this ranking between PEAROME and PEARP.

  • Over Brittany (C), PEAROME models well a zone with little probability of precipitation, corresponding to a respite between the passage of the front and the arrival of the rear sky. Conversely, the majority of PEARP members forecast precipitation over this zone and therefore high probabilities. PEAROME is closer to the observed frequency than PEARP, as can be seen on BDn with and without neighborhood.

  • Over the Pisa region (D), PEARP predicts significant probabilities over a fairly widespread area. PEAROME predicts high probabilities in a more localized manner with a slight northward shift compared with reality. PEAROME is closer to reality than PEARP, as BDn shows. PEAROME’s small location errors are totally redeemed by the 0.525° in contrast to PEARP’s false alarms.

  • Over the northeast of the control domain, both models predict high probabilities over an extended area, but nothing is observed. Both models are in outright false alarm with a BDn of 1. The application of a neighborhood does not redeem these false alarms.

Fig. 8.
Fig. 8.

BDn for the accumulated rainfall event greater than 0.5 mm in 6 h at 1200 UTC 27 Jan for PEARP calculated (a) without neighborhood and (b) with a neighborhood of 0.525°. The same BDn fields are plotted for PEAROME (c) without neighborhood and (d) with a neighborhood of 0.525°. The observations are provided by the ANTILOPE analysis. Masked areas where no radar data are available are hatched, and the same rectangular boxes are drawn as in Fig. 7.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

Over the whole domain, the BDn¯ of PEARP without a neighborhood is 0.088, while that of PEAROME is 0.092; PEAROME is more affected by the double penalty than PEARP, particularly in zones A-1 and B. After applying a neighborhood, BDn¯ for PEARP is 0.052, while BDn¯ for PEAROME is 0.049. The neighborhood corrected the double penalty more for PEAROME than for PEARP, allowing PEAROME to have a better BDn¯ than PEARP.

c. Comparison of two ensemble forecasts during 1 year

The PEAROME forecasts evaluated without neighborhood against the ANTILOPE analysis degrades with the forecast time as shown by the temporal evolution of BDn¯ (Figs. 9a,c). These temporal evolutions also show a very pronounced diurnal cycle for light (rain accumulated during 6 h > 0.5 mm) and moderate (rain accumulated during 6 h > 5 mm) rainfall. The maximum error is at 1800 UTC which corresponds to the daily maximum of convective rainfall over France. The uncertainty term UNC for light rainfall is higher than for moderate rainfall by a factor of 4 (Fig. 9) in about the same proportion as the frequencies of occurrence of these rainfalls (not shown). The decrease in UNC also drives the decrease in BDn¯ for moderate rainfall relative to light rainfall.

Fig. 9.
Fig. 9.

The BDn¯ (full lines) and uncertainty (dashed black lines) averaged over the year 2020 as a function of the lead time for the rainfall thresholds 0.5 mm in 6 h (a) using no neighborhood and (b) using a neighborhood of ln = 0.525°. The lower panels correspond to the thresholds 5 mm in 6 h (c) using no neighborhood and (d) using a neighborhood of ln = 0.525°. The PEAROME results are drawn with purple lines and the PEARP results with orange lines. D and D1 correspond to the first and second day’s forecast. The observations are provided by the ANTILOPE analysis.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

Taking into account a neighborhood of 0.525° in BDn¯ leads to a decrease in UNC because of pooling, which reduces the spatial variability of observations at the verification points and also the diurnal cycle of the convection at that scale more for moderate rain that for light rain (which are more widespread). The term BDn¯ follows this decrease (Fig. 9) and amplifies it as can be seen in the temporal evolution of BDnSS¯ (Fig. 10): the skill is therefore improved. The BDnSS¯ of PEAROME for ln = 0.525° is higher than that for ln = 0.025°. Thus, the forecasts are more skillful when considered at scales larger than those of the forecast calculation grid.

Fig. 10.
Fig. 10.

The BDnSS¯ (full lines), normalized reliability term (dashed lines), and normalized generalized resolution term (dotted lines) averaged over the year 2020 as a function of the lead time for the rainfall threshold 0.5 mm in 6 h (a) using no neighborhood and (b) using a neighborhood of ln = 0.525°. The lower panels correspond to the threshold 5 mm in 6 h (c) using no neighborhood and (d) using a neighborhood of ln = 0.525°. The PEAROME results are drawn with purple lines and the PEARP results with orange lines. The power of the statistical test on the difference in BDn¯ is set at 5% and is represented on the BDnSS¯ curve of PEAROME by the symbols: purple triangle up if PEAROME is better than PEARP, orange triangle down if PEARP is better than PEAROME, and circle if the difference is not significant at the 5% level. The observations are provided by the ANTILOPE analysis.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

The decomposition of BDnSS¯ [Eq. (12)] using 36 bins shows that PEAROME is almost reliable for the four cases since the reliability error REL/UNC is small for this ensemble forecast when compared to the generalized resolution term GRES/UNC. The impact of the neighborhood is significant with a gain of BDnSS¯ which is due to the generalized resolution term. Another advantage of the decomposition is that we can construct a reliability diagram connecting the different coordinate points (fnk¯,onk¯) as in Duc et al. (2013). We plot it at 1800 UTC in Fig. 11. We can thus visualize the small deviations of the PEAROME forecast from the diagonal except for the least frequent values where fnk¯ is close to 1. We also note that the PEAROME reliability is improved for the case ln = 0.525° compared to the case ln = 0.025° for each bin. This allows us to verify once again that PEAROME is almost reliable even if there remains a small conditional bias (Jolliffe and Stephenson 2011). We can also see that the sharpness diagram (Fig. 11) shows the strong reduction of the population of the different decomposition bins of BDn¯ when we take a neighborhood of 0.525° but keep roughly the same shape as that without a neighborhood (no accumulation around the observed frequency for the large neighborhoods as in the previous idealized cases with small Lc). This reduction results in stronger sampling noise for ln = 0.525° than for ln = 0.025° in the reliability diagram.

Fig. 11.
Fig. 11.

(a) Reliability diagram (including a sharpness diagram) over the year 2020 at 1800 UTC on the first day of simulation for the threshold 0.5 mm in 6 h for PEAROME (purple lines) and PEARP (orange lines) using no neighborhood (full lines) and using a neighborhood of ln = 0.525° (dashed lines). The horizontal and vertical colored lines correspond to on¯ and fn¯, respectively. (b) The threshold 5 mm in 6 h. The observations are provided by the ANTILOPE analysis.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

PEARP presents lower quality rainfall forecasts against the ANTILOPE analysis than PEAROME since its BDn¯ is higher than PEAROME’s (Fig. 9) or its BDnSS¯ is lower than PEAROME’s (Fig. 10) for all lead times for both thresholds and both neighborhoods. The decomposition of BDnSS¯ using 36 bins shows that PEARP’s GRES/UNC is as good as PEAROME’s GRES/UNC and sometimes slightly improved and that PEARP is not as reliable as PEAROME. PEARP’s REL/UNC improves with time and is better for moderate rainfall than for light rainfall. This point is related to the parameterization of convection used in the PEARP global models, which overestimates light rainfall and underestimates heavy rainfall. This bias in the forecast can be seen in the PEARP reliability plot (Fig. 11a) which for light rainfall shows that neighborhood-averaged probabilities fnk¯ are higher than neighborhood-averaged observations onk¯ for all bins in the decomposition or averaged over all bins (fn¯>on¯). It should also be noted that for both thresholds, the reliability error for PEARP is increased by taking into account a neighborhood of ln = 0.525° but to a lesser extent than the improvement of the generalized resolution term (Fig. 10). We thus obtain an improvement of BDnSS¯ with the increase of the neighborhood for PEARP. Figure 11 shows that the reliability of PEARP is improved by applying a neighborhood only for high probabilities but this occurs much less often than lower probabilities. On average, the reliability of PEARP is therefore made worse by applying a neighborhood as shown also in Fig. 10. A remarkable point in the evolution of PEARP scores with lead time is the better quality of the rain forecast at 0600 UTC of day D1 compared to that at 0600 UTC of day D (Fig. 10). The decomposition of BDnSS¯ attributes this improvement to the reduction of the reliability term for longer lead times (Fig. 10). It is explained by the overdispersion of the PEARP for light rainfall brought about by the perturbation of the initial conditions, which is very strong (not shown) and which decreases during the forecast.

Comparing these two ensemble forecasts against the ANTILOPE analysis, we can conclude that PEAROME provides better rainfall forecasts for these two thresholds. In the case without neighborhood averaging, the difference is more important for light rainfall than for moderate rainfall and is mainly due to the reliability error of PEARP which is negligible for PEAROME. It is also important to note that the difference between the two ensemble forecasts increases when they are compared at the scale of a 0.525° neighborhood averaging. For light rainfall, we find that the impact of the neighborhood on the reliability term of PEARP on the BDnSS¯ is negative, while it remains negligible for PEAROME. For moderate rainfall, we have the same growth with the neighborhood of the PEARP reliability error as for light rainfall. We can also note that the generalized resolution term for PEAROME and PEARP is close, while the generalized resolution term for PEARP was slightly better without a neighborhood. These variations correspond to the expected impact of the mitigation of the double penalty by the spatial tolerance brought by the use of a neighborhood which is maximal for an unbiased forecast as illustrated previously on idealized cases.

We return to 1800 UTC on the first day of simulation in order to analyze the variation of BDnSS¯ as a function of the size of the neighborhood ln. We have access to the distribution of BDnSS¯ by plotting the minimum, maximum, and its mean value on the nine realizations obtained by varying the starting point of the disjoint neighborhood calculations (Fig. 12). We note that BDnSS¯ is not as well estimated for large values of ln due to the decrease in the number of verification points used. Both ensemble forecasts improve with ln in a continuous way. Nevertheless, the conclusions of the comparison between PEAROME and PEARP are confirmed for both thresholds.

Fig. 12.
Fig. 12.

(a) The BDnSS¯ over the year 2020 at 1800 UTC on the first day of simulation as a function of the neighborhood length scale ln for PEAROME (purple lines) and PEARP (orange lines) for the threshold 0.5 mm in 6 h. (b) The threshold 5 mm in 6 h. The result of the statistical test is represented in the same way as in Fig. 10. The observations are provided by the ANTILOPE analysis.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

d. Comparison of an ensemble forecast with a deterministic forecast during 1 year

We now use the limiting case of the deterministic AROME forecast. In the neighborhood-free case and in the deterministic limit,
BDnSS¯=11PCo¯(1o¯).
The BDnSS¯ for AROME against the ANTILOPE analysis is clearly worse than the BDnSS¯ of PEAROME against the ANTILOPE analysis (Fig. 13) especially for moderate rainfall where AROME is even less accurate than the forecast made by the sample climatology.
Fig. 13.
Fig. 13.

As in Fig. 10, but for PEAROME (purple lines) and AROME (blue lines).

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

In the case of the 0.525° neighborhood, the AROME forecast is transformed into a pooled probabilistic forecast in the neighborhood corresponding to a 441-member ensemble forecast. The BDnSS¯ decomposition of AROME is then performed on the 17 bins also used for the PEAROME. We note a significant increase in BDnSS¯ as a function of the neighborhood size for AROME which is much faster than that for PEAROME leading to a reduction in the relative gap between the two forecasts. This shows that the correction of the double penalty by introducing a neighborhood is more active for a deterministic forecast than for an ensemble forecast. Both reliability and generalized resolution terms are improved by a factor of two by taking into account the neighborhood for AROME. Consequently, PEAROME is of better quality for the two thresholds and the two neighborhoods.

The growth of BDnSS¯ with ln (Fig. 14) shows that PEAROME is better than AROME for all verification scales even though the difference in quality measured by BDnSS¯ decreases with ln. This indicates that deterministic forecasts are more sensitive to the double penalty than forecast ensembles. It also validates a posteriori the method of neighborhoods which makes the assumption that there is probabilistic information of high quality to be recovered from the forecasts of the points close to the observation point as was shown in Mittermaier (2014).

Fig. 14.
Fig. 14.

As in Fig. 12, but for PEAROME (purple lines) and AROME (blue lines).

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

e. Comparison of two deterministic forecasts during 1 year

For the 0.5-mm threshold, we observe in Fig. 15 that AROME obtains better results than ARPEGE in terms of BDnSS¯ against the ANTILOPE analysis when no neighborhood is taken into account. The decomposition using 17 bins shows that the generalized resolution terms are very close but that the reliability is better for AROME than for ARPEGE. The likely explanation for this shortcoming of ARPEGE comes from the positive bias of ARPEGE which is similar to PEARP for light convective rainfall due to its precipitating convection scheme (Stein and Stoop 2019). If we compare their performance at a scale of 0.525°, we see that both BDnSS¯ improve significantly. The generalized resolution terms improve slightly for both models, but the AROME model is now always better than the ARPEGE model. Moreover, the reliability terms are also greatly improved (and thus reduced). We can see that the difference between the BDnSS¯ of AROME and ARPEGE increases when comparing them on a larger scale because the effects of the double penalty are better corrected by the spatial tolerance introduced thanks to pooling in the case of an unbiased model like AROME.

Fig. 15.
Fig. 15.

As in Fig. 10, but for AROME (blue lines) and ARPEGE (red lines).

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

For the 5-mm threshold, the neighborhood-free BDnSS¯ for both AROME and ARPEGE are both negative (Fig. 15) which shows that the sample climatology is a hard-to-beat forecast for this rarer event than for the 0.5-mm threshold. The decomposition in this case provides resolution and reliability terms which remain close for both models. This probably corresponds to the fact that ARPEGE does not have a strong bias in the case of moderate rainfall as does PEARP and is thus closer to AROME (Stein and Stoop 2019). The comparison at a larger scale of 0.525° shows that the correction of the double penalty in the calculation of BDnSS¯ is more effective for AROME than for ARPEGE by improving both the resolution and reliability terms.

In the end, AROME predicts both light and moderate rainfall better than ARPEGE at both comparison scales.

f. Comparison of BDnSS¯ and FSS¯ during 1 year

The BDnSS¯ and FSS¯ differ only in their normalizations in forming a skill score. The first normalization on2¯on¯2 corresponds to the climatology of the sample, and the second normalization on2¯+fn2¯ uses a forecast increasing the forecast error. These two scores as well as the values of the two reference forecasts are plotted for comparison on real cases of ensemble forecasts of light and moderate rainfall (Fig. 16). We see that on our real cases, BDnSS¯ is always smaller than FSS¯ because the reference provided by the sample climatology forecast makes an error (measured by BDn¯) smaller than the reference forecast used for FSS¯ since their difference in BDn¯ is equal to (on¯2+fn2¯) and is therefore always negative. The sample climatology is always a harder-to-beat forecast than the reference forecast for FSS¯. We also notice that the reference for FSS¯ is not the same for PEAROME and PEARP because of the fn2¯ term. This results in the ranking of the models according to FSS¯ not being the same as the one provided by BDn¯ (Fig. 9) on which it is based and which is a proper score. The rankings provided by BDnSS¯ and BDn¯ are identical since the normalization by the BDn of the climatology does not depend on the forecast. The reversal of the ranking of these ensemble forecasts by FSS¯ depends on their biases Bn. One way of getting around the problem is to remove these biases, but it is not clear how a debiasing of an ensemble forecast should be done or whether it will restore consistency between the rankings based on FSS¯ and BDn¯. On the other hand, we can see that both scores FSS¯ and BDnSS¯ limit the impact of the double penalty more significantly for PEAROME than for PEARP. Nevertheless, this relative improvement of FSS¯ for PEAROME compared to PEARP is not sufficient for the restoration of the ranking provided by BDn¯.

Fig. 16.
Fig. 16.

The BDnSS¯ (full lines), FSS¯ (dashed lines), UNC (dotted black lines), and on2¯+fn2¯ (dotted-dashed lines) averaged over the year 2020 as a function of the lead time for the rainfall thresholds 0.5 mm in 6 h (a) using no neighborhood and (b) using a neighborhood of ln = 0.525°. The lower panels correspond to the thresholds 5 mm in 6 h (c) using no neighborhood and (d) using a neighborhood of ln = 0.525°. The PEAROME results are drawn with purple lines and the PEARP results with orange lines. The power of the statistical test on the difference in BDn¯ is set at 5% and is represented on the BDnSS¯ curve of PEAROME by symbols: purple upward triangle if PEAROME is better than PEARP, orange downward triangle if PEARP is better than PEAROME, and circle if the difference is not significant at the 5% level.

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

The comparison between BDnSS¯ and FSS¯ is applied to the deterministic forecasts of AROME and ARPEGE (Fig. 17). We find the same conclusions as for the case of ensemble forecasts with a reference forecast that is easier to beat for FSS¯ and which depends on the forecast as already noted by Amodei et al. (2015). It can be seen that in this case, the correction of the double penalty through the use of a 0.525° neighborhood is sufficiently effective for AROME to achieve better forecasts of light and moderate rainfall whereas this was not the case without taking into account the neighborhood.

Fig. 17.
Fig. 17.

As in Fig. 16, but for AROME (blue lines) and ARPEGE (red lines).

Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1

6. Conclusions

A two-step procedure was developed: 1) pooling of forecasts and observations of a binary event taken at all points in a neighborhood and 2) use of Brier divergence to measure the deviation between the predicted and observed neighborhood distributions. The score BDn thus constructed allows the reward of event forecasts close to the location where this event was observed. This methodology reformulates the ensemble FSS of Duc et al. (2013), which generalizes the FBS proposed by Roberts (2008) for the case of forecast ensembles. The BDn admits a deterministic limit obtained for a one-member ensemble forecasting binary frequencies. We can thus measure the contribution of ensemble forecasts compared to deterministic forecasts to better understand the uncertainty of forecasts.

This score has been used in idealized cases to show its ability to detect nonlocal spatial correlations between forecasts and observations. It was then used in real cases of quantitative rainfall forecasts for four deterministic and probabilistic forecasting systems operational at Météo-France. The term BDn¯ is a proper score that allows forecasts to be ranked correctly for a given neighborhood scale in all cases.

The BS decomposition proposed by Stephenson et al. (2008) has been generalized to the BDn. It allows the highlighting of a new uncertainty term which is adapted to the case where the observed frequency is different from 0 or 1. It is equal to the Brier divergence of the sample climatology and represents the variance of the observations pooled in the neighborhood. This trivial forecast is independent of the forecast and is used as a reference to build a skill score for the Brier divergence. Next, we find a reliability term and three terms which together form the generalized resolution term: resolution term, within-bin forecast variance, and within-bin forecast-observation covariance. All the terms keep the same interpretation as in the original BS decomposition. This decomposition can be performed for an arbitrary number of bins and allows a better analysis of the faults and qualities of the forecasts as a function of the scale of the verification. The decomposition quantifies the reliability error and its variation when the scale of the verification is increased. A comparison of BDnSS¯ with the ensemble FSS¯ of Duc et al. (2013) shows that both BDn-based scores reduce the impact of double penalty on real cases when the verification scale is increased. The term BDnSS¯ has the advantage of using a reference forecast which is independent of the forecast to compare two ensemble or deterministic forecasts. The term BDnSS¯ is proper like BDn¯ and allows forecasts to be ranked against each other for a given neighborhood scale. The terms FSS¯ and BDnSS¯ do not have the same attributes that may be desirable for a verification metric. The primary purpose of FSS¯ is “to understand the evolution of skill as a function of length scale” (e.g., Mittermaier 2021), whereas the main purpose of BDnSS¯ is to provide a proper scoring rule that permits a fair comparison of competing forecast systems while relaxing the gridpoint-to-gridpoint type of verification to a neighborhood-to-neighborhood comparison.

Future work concerns the use of this decomposition in order to generate ROC and economic value diagrams including the notion of neighborhood. The term BDnSS¯ will also be used to define a performance indicator for the high-resolution forecasting ensemble PEAROME, taking up most of the characteristics of the indicator followed for the AROME model used during the last few years (Amodei and Stein 2009). These performance indicators will be followed as administrative indicator of the operational suite of numerical weather prediction at Météo-France. A test framework for measuring the extent to which positioning error is taken into account for different scores has been proposed by Gilleland et al. (2009) and Ahijevych et al. (2009) for deterministic forecasts. Its generalization to probabilistic forecast scores and its application to BDnSS¯ are interesting perspectives for this article.

Acknowledgments.

We are grateful to Naomi Riviere for reviewing this manuscript, to Olivier Mestre for generating the correlated observations with, and to the three anonymous reviewers and our editor M. Scheuerer for their valuable comments.

Data availability statement.

The operational model outputs come from the Météo-France archive. The deterministic forecasts can be obtained through the portal http://dcpcpnp-int-p.meteo.fr/openwis-user-portal/srv/en/main.home, but the ensemble forecasts are not public and are unavailable.

REFERENCES

  • Ahijevych, D., E. Gilleland, B. G. Brown, and E. E. Ebert, 2009: Application of spatial verification methods to idealized and NWP-gridded precipitation forecasts. Wea. Forecasting, 24, 14851497, https://doi.org/10.1175/2009WAF2222298.1.

    • Search Google Scholar
    • Export Citation
  • Amodei, M., and J. Stein, 2009: Deterministic and fuzzy verification methods for a hierarchy of numerical models. Meteor. Appl., 16, 191203, https://doi.org/10.1002/met.101.

    • Search Google Scholar
    • Export Citation
  • Amodei, M., I. Sanchez, and J. Stein, 2015: Verification of the French operational high-resolution model AROME with the regional Brier probability score. Meteor. Appl., 22, 731745, https://doi.org/10.1002/met.1510.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., T. Haiden, and D. S. Richardson, 2018: The diagonal score: Definition, properties, and interpretations. Quart. J. Roy. Meteor. Soc., 144, 14631473, https://doi.org/10.1002/qj.3293.

    • Search Google Scholar
    • Export Citation
  • Bouttier, F., and L. Raynaud, 2018: Clustering and selection of boundary conditions for limited-area ensemble prediction. Quart. J. Roy. Meteor. Soc., 144, 23812391, https://doi.org/10.1002/qj.3304.

    • Search Google Scholar
    • Export Citation
  • Bröcker, J., and L. A. Smith, 2007: Scoring probabilistic forecasts: The importance of being proper. Wea. Forecasting, 22, 382388, https://doi.org/10.1175/WAF966.1.

    • Search Google Scholar
    • Export Citation
  • Candille, G., and O. Talagrand, 2008: Impact of observational error on the validation of ensemble prediction systems. Quart. J. Roy. Meteor. Soc., 134, 959971, https://doi.org/10.1002/qj.268.

    • Search Google Scholar
    • Export Citation
  • Courtier, P., C. Freydier, J. Geleyn, F. Rabier, and M. Rochas, 1991: The Arpege project at Meteo France. ECMWF Seminar Proc., Reading, United Kingdom, ECMWF, 193–231, https://www.ecmwf.int/en/elibrary/74049-arpege-project-meteo-france.

  • Descamps, L., C. Labadie, A. Joly, E. Bazile, P. Arbogast, and P. Cébron, 2015: PEARP, the Météo-France short-range ensemble prediction system. Quart. J. Roy. Meteor. Soc., 141, 16711685, https://doi.org/10.1002/qj.2469.

    • Search Google Scholar
    • Export Citation
  • Desroziers, G., G. Hello, and J.-N. Thépaut, 2003: A 4D-Var re-analysis of FASTEX. Quart. J. Roy. Meteor. Soc., 129, 13011315, https://doi.org/10.1256/qj.01.182.

    • Search Google Scholar
    • Export Citation
  • Duc, L., K. Saito, and H. Seko, 2013: Spatial-temporal fractions verification for high-resolution ensemble forecasts. Tellus, 65A, 18171, https://doi.org/10.3402/tellusa.v65i0.18171.

    • Search Google Scholar
    • Export Citation
  • Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 5164, https://doi.org/10.1002/met.25.

    • Search Google Scholar
    • Export Citation
  • Gilleland, E., D. Ahijevych, B. G. Brown, B. Casati, and E. E. Ebert, 2009: Intercomparison of spatial forecast verification methods. Wea. Forecasting, 24, 14161430, https://doi.org/10.1175/2009WAF2222269.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, Eds., 2011: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 304 pp., https://doi.org/10.1002/9781119960003.

  • Laurantin, O., 2008: Antilope: Hourly rainfall analysis merging radar and rain gauge data. Proc. Int. Symp. on Weather Radar and Hydrology, Grenoble, France, Laboratoire d’étude des Transferts en Hydrologie et Environnement (LTHE), 2–8.

  • Mittermaier, M. P., 2014: A strategy for verifying near-convection-resolving model forecasts at observing sites. Wea. Forecasting, 29, 185204, https://doi.org/10.1175/WAF-D-12-00075.1.

    • Search Google Scholar
    • Export Citation
  • Mittermaier, M. P., 2021: A “meta” analysis of the fractions skill score: The limiting case and implications for aggregation. Mon. Wea. Rev., 149, 34913504, https://doi.org/10.1175/MWR-D-18-0106.1.

    • Search Google Scholar
    • Export Citation
  • Mittermaier, M. P., and G. Csima, 2017: Ensemble versus deterministic performance at the kilometer scale. Wea. Forecasting, 32, 16971709, https://doi.org/10.1175/WAF-D-16-0164.1.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • R Core Team, 2019: R: A language and environment for statistical computing. R Foundation for Statistical Computing, https://www.R-project.org/.

  • Ribeiro, P. J., Jr., P. J. Diggle, M. Schlather, R. Bivand, and B. Ripley, 2020: geoR: Analysis of Geostatistical Data, version 1.8-1. R package, https://CRAN.R-project.org/package=geoR.

  • Roberts, N., 2008: Assessing the spatial and temporal variation in the skill of precipitation forecasts from an NWP model. Meteor. Appl., 15, 163169, https://doi.org/10.1002/met.57.

    • Search Google Scholar
    • Export Citation
  • Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 7897, https://doi.org/10.1175/2007MWR2123.1.

    • Search Google Scholar
    • Export Citation
  • Schwartz, C. S., and R. A. Sobash, 2017: Generating probabilistic forecasts from convection-allowing ensembles using neighborhood approaches: A review and recommendations. Mon. Wea. Rev., 145, 33973418, https://doi.org/10.1175/MWR-D-16-0400.1.

    • Search Google Scholar
    • Export Citation
  • Schwartz, C. S., G. S. Romine, R. A. Sobash, K. R. Fossell, and M. L. Weisman, 2019: NCAR’s real-time convection-allowing ensemble project. Bull. Amer. Meteor. Soc., 100, 321343, https://doi.org/10.1175/BAMS-D-17-0297.1.

    • Search Google Scholar
    • Export Citation
  • Seity, Y., P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson, 2011: The AROME-France convective-scale operational model. Mon. Wea. Rev., 139, 976991, https://doi.org/10.1175/2010MWR3425.1.

    • Search Google Scholar
    • Export Citation
  • Stein, J., and F. Stoop, 2019: Neighborhood-based contingency tables including errors compensation. Mon. Wea. Rev., 147, 329344, https://doi.org/10.1175/MWR-D-17-0288.1.

    • Search Google Scholar
    • Export Citation
  • Stein, J., and F. Stoop, 2022: Neighborhood-based ensemble evaluation using the CRPS. Mon. Wea. Rev., 150, 19011914, https://doi.org/10.1175/MWR-D-21-0224.1.

    • Search Google Scholar
    • Export Citation
  • Stephenson, D. B., C. A. S. Coelho, and I. T. Jolliffe, 2008: Two extra components in the brier score decomposition. Wea. Forecasting, 23, 752757, https://doi.org/10.1175/2007WAF2006116.1.

    • Search Google Scholar
    • Export Citation
  • Thorarinsdottir, T. L., T. Gneiting, and N. Gissibl, 2013: Using proper divergence functions to evaluate climate models. SIAM/ASA J. Uncertainty Quantif., 1, 522534, https://doi.org/10.1137/130907550.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.

Save
  • Ahijevych, D., E. Gilleland, B. G. Brown, and E. E. Ebert, 2009: Application of spatial verification methods to idealized and NWP-gridded precipitation forecasts. Wea. Forecasting, 24, 14851497, https://doi.org/10.1175/2009WAF2222298.1.

    • Search Google Scholar
    • Export Citation
  • Amodei, M., and J. Stein, 2009: Deterministic and fuzzy verification methods for a hierarchy of numerical models. Meteor. Appl., 16, 191203, https://doi.org/10.1002/met.101.

    • Search Google Scholar
    • Export Citation
  • Amodei, M., I. Sanchez, and J. Stein, 2015: Verification of the French operational high-resolution model AROME with the regional Brier probability score. Meteor. Appl., 22, 731745, https://doi.org/10.1002/met.1510.

    • Search Google Scholar
    • Export Citation
  • Ben Bouallègue, Z., T. Haiden, and D. S. Richardson, 2018: The diagonal score: Definition, properties, and interpretations. Quart. J. Roy. Meteor. Soc., 144, 14631473, https://doi.org/10.1002/qj.3293.

    • Search Google Scholar
    • Export Citation
  • Bouttier, F., and L. Raynaud, 2018: Clustering and selection of boundary conditions for limited-area ensemble prediction. Quart. J. Roy. Meteor. Soc., 144, 23812391, https://doi.org/10.1002/qj.3304.

    • Search Google Scholar
    • Export Citation
  • Bröcker, J., and L. A. Smith, 2007: Scoring probabilistic forecasts: The importance of being proper. Wea. Forecasting, 22, 382388, https://doi.org/10.1175/WAF966.1.

    • Search Google Scholar
    • Export Citation
  • Candille, G., and O. Talagrand, 2008: Impact of observational error on the validation of ensemble prediction systems. Quart. J. Roy. Meteor. Soc., 134, 959971, https://doi.org/10.1002/qj.268.

    • Search Google Scholar
    • Export Citation
  • Courtier, P., C. Freydier, J. Geleyn, F. Rabier, and M. Rochas, 1991: The Arpege project at Meteo France. ECMWF Seminar Proc., Reading, United Kingdom, ECMWF, 193–231, https://www.ecmwf.int/en/elibrary/74049-arpege-project-meteo-france.

  • Descamps, L., C. Labadie, A. Joly, E. Bazile, P. Arbogast, and P. Cébron, 2015: PEARP, the Météo-France short-range ensemble prediction system. Quart. J. Roy. Meteor. Soc., 141, 16711685, https://doi.org/10.1002/qj.2469.

    • Search Google Scholar
    • Export Citation
  • Desroziers, G., G. Hello, and J.-N. Thépaut, 2003: A 4D-Var re-analysis of FASTEX. Quart. J. Roy. Meteor. Soc., 129, 13011315, https://doi.org/10.1256/qj.01.182.

    • Search Google Scholar
    • Export Citation
  • Duc, L., K. Saito, and H. Seko, 2013: Spatial-temporal fractions verification for high-resolution ensemble forecasts. Tellus, 65A, 18171, https://doi.org/10.3402/tellusa.v65i0.18171.

    • Search Google Scholar
    • Export Citation
  • Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 5164, https://doi.org/10.1002/met.25.

    • Search Google Scholar
    • Export Citation
  • Gilleland, E., D. Ahijevych, B. G. Brown, B. Casati, and E. E. Ebert, 2009: Intercomparison of spatial forecast verification methods. Wea. Forecasting, 24, 14161430, https://doi.org/10.1175/2009WAF2222269.1.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, Eds., 2011: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 304 pp., https://doi.org/10.1002/9781119960003.

  • Laurantin, O., 2008: Antilope: Hourly rainfall analysis merging radar and rain gauge data. Proc. Int. Symp. on Weather Radar and Hydrology, Grenoble, France, Laboratoire d’étude des Transferts en Hydrologie et Environnement (LTHE), 2–8.

  • Mittermaier, M. P., 2014: A strategy for verifying near-convection-resolving model forecasts at observing sites. Wea. Forecasting, 29, 185204, https://doi.org/10.1175/WAF-D-12-00075.1.

    • Search Google Scholar
    • Export Citation
  • Mittermaier, M. P., 2021: A “meta” analysis of the fractions skill score: The limiting case and implications for aggregation. Mon. Wea. Rev., 149, 34913504, https://doi.org/10.1175/MWR-D-18-0106.1.

    • Search Google Scholar
    • Export Citation
  • Mittermaier, M. P., and G. Csima, 2017: Ensemble versus deterministic performance at the kilometer scale. Wea. Forecasting, 32, 16971709, https://doi.org/10.1175/WAF-D-16-0164.1.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • R Core Team, 2019: R: A language and environment for statistical computing. R Foundation for Statistical Computing, https://www.R-project.org/.

  • Ribeiro, P. J., Jr., P. J. Diggle, M. Schlather, R. Bivand, and B. Ripley, 2020: geoR: Analysis of Geostatistical Data, version 1.8-1. R package, https://CRAN.R-project.org/package=geoR.

  • Roberts, N., 2008: Assessing the spatial and temporal variation in the skill of precipitation forecasts from an NWP model. Meteor. Appl., 15, 163169, https://doi.org/10.1002/met.57.

    • Search Google Scholar
    • Export Citation
  • Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 7897, https://doi.org/10.1175/2007MWR2123.1.

    • Search Google Scholar
    • Export Citation
  • Schwartz, C. S., and R. A. Sobash, 2017: Generating probabilistic forecasts from convection-allowing ensembles using neighborhood approaches: A review and recommendations. Mon. Wea. Rev., 145, 33973418, https://doi.org/10.1175/MWR-D-16-0400.1.

    • Search Google Scholar
    • Export Citation
  • Schwartz, C. S., G. S. Romine, R. A. Sobash, K. R. Fossell, and M. L. Weisman, 2019: NCAR’s real-time convection-allowing ensemble project. Bull. Amer. Meteor. Soc., 100, 321343, https://doi.org/10.1175/BAMS-D-17-0297.1.

    • Search Google Scholar
    • Export Citation
  • Seity, Y., P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson, 2011: The AROME-France convective-scale operational model. Mon. Wea. Rev., 139, 976991, https://doi.org/10.1175/2010MWR3425.1.

    • Search Google Scholar
    • Export Citation
  • Stein, J., and F. Stoop, 2019: Neighborhood-based contingency tables including errors compensation. Mon. Wea. Rev., 147, 329344, https://doi.org/10.1175/MWR-D-17-0288.1.

    • Search Google Scholar
    • Export Citation
  • Stein, J., and F. Stoop, 2022: Neighborhood-based ensemble evaluation using the CRPS. Mon. Wea. Rev., 150, 19011914, https://doi.org/10.1175/MWR-D-21-0224.1.

    • Search Google Scholar
    • Export Citation
  • Stephenson, D. B., C. A. S. Coelho, and I. T. Jolliffe, 2008: Two extra components in the brier score decomposition. Wea. Forecasting, 23, 752757, https://doi.org/10.1175/2007WAF2006116.1.

    • Search Google Scholar
    • Export Citation
  • Thorarinsdottir, T. L., T. Gneiting, and N. Gissibl, 2013: Using proper divergence functions to evaluate climate models. SIAM/ASA J. Uncertainty Quantif., 1, 522534, https://doi.org/10.1137/130907550.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.

  • Fig. 1.

    Observation spatial field at one time constructed with a statistical distribution corresponding to σc = 1, σo = 0.2, and (a) Lc = 80 and (b) Lc = 2. A black circle with a radius equal to Lc is drawn at the center of the domain. Forecast spatial field of one member of the ensemble forecast built at the same time and with the statistical distribution corresponding to σc = 1, μf = 0.0, σf = 1.0, and (c) Lc = 80 and (d) Lc = 2.

  • Fig. 2.

    Experiment EXP_ref: (a) sharpness diagram formed by plotting the number nk of forecast cases falling in bin k against the central predicted frequency of bin k for different neighborhood length scales ln: 1, 3, 5, 9, 15, 21, 35, and 49 grid points. The number of bins is 36 since the ensemble forecast contains 35 members. The vertical scale of the plot is logarithmic. (b) Variations of the terms of the decomposition of BDn¯ are plotted as a function of the neighborhood length scale ln: the BDn¯ in a black solid line, RES in a green dotted line, WBV in a brown dotted line, WBC in an orange dotted line, REL in blue dashed line, and UNC in a red solid line. (c) Variations as a function of the neighborhood length scale ln of BDnSS¯ (full black line), GRES/UNC (dotted red line), and REL/UNC (dashed red lines) are plotted. Each symbol (disk or “x”) corresponds to a numerical experiment.

  • Fig. 3.

    (a) Variation of the BDnSS¯ as a function of the neighborhood length scale ln for experiments using the same parameters as EXP_ref except σf = 0 (orange line), σf = 0.1 (yellow line), σf = 0.2 (black line for EXP_ref), σf = 0.3 (green line), σf = 0.6 (red line), and σf = 1.0 (blue line). Variations as a function of σf for BDn¯/UNC (black solid line), 1 − GRES/UNC (red dotted line), REL/UNC (blue dashed line), and Bn (green dotted-dashed line) are plotted for (b) ln = 1 and for (c) ln = 49. The vertical scale of (b) and (c) is logarithmic.

  • Fig. 4.

    (a) Variation of the BDnSS¯ on as a function of the neighborhood length scale ln for experiments where the range parameter Lc is varied: Lc = 80 for EXP_ref (black line), Lc = 32 (green line), Lc = 16 (purple line), Lc = 8 (orange line), and Lc = 2 (blue line). Variations as a function of Lc for BDn¯/UNC (black solid line), 1 − GRES/UNC (red dotted line), REL/UNC (blue dashed line), and Bn (green dotted-dashed line) are plotted for (b) ln = 1 and (c) ln = 49. The vertical scale of (b) and (c) is logarithmic.

  • Fig. 5.

    (a) Variation of the BDnSS¯ (full lines) and FSS¯ (dashed-dotted lines) as a function of the neighborhood length scale ln for experiments EXP_ref (black line), EXP_B+ (orange line), and EXP_B− (green line) for Lc = 80. (b) The same parameters are plotted for the same experiments except Lc = 2. Variations of the terms of the BDnSS¯ decomposition as a function of the neighborhood length scale ln: (c) GRES/UNC and (d) REL/UNC in full lines for Lc = 80 and dotted lines for Lc = 2. Each symbol corresponds to a numerical experiment.

  • Fig. 6.

    Sharpness diagram with the same legend as Fig. 2a but for an experiment with Lc = 2, σc = 1, σo = 0.2, μf = 0, and σf = 0.2.

  • Fig. 7.

    Occurrences of the event: (a) accumulated rainfall greater than 0.5 mm in 6 h observed by the ANTILOPE analysis at 1200 UTC 27 Jan are colored purple. Masked areas where no radar data are available are hatched. (b) The mean frequencies over the 0.525° neighborhood of this event computed from the ANTILOPE analysis are plotted with the color palette given at the bottom of the figure. The mean probabilities of the event predicted by PEARP are plotted with the same color palette (c) without neighborhood and (d) with a neighborhood of 0.525°. (e),(f) The same mean probabilities are plotted in the same way for PEAROME. Rectangular boxes are drawn in black to facilitate comparison of the different panels.

  • Fig. 8.

    BDn for the accumulated rainfall event greater than 0.5 mm in 6 h at 1200 UTC 27 Jan for PEARP calculated (a) without neighborhood and (b) with a neighborhood of 0.525°. The same BDn fields are plotted for PEAROME (c) without neighborhood and (d) with a neighborhood of 0.525°. The observations are provided by the ANTILOPE analysis. Masked areas where no radar data are available are hatched, and the same rectangular boxes are drawn as in Fig. 7.

  • Fig. 9.

    The BDn¯ (full lines) and uncertainty (dashed black lines) averaged over the year 2020 as a function of the lead time for the rainfall thresholds 0.5 mm in 6 h (a) using no neighborhood and (b) using a neighborhood of ln = 0.525°. The lower panels correspond to the thresholds 5 mm in 6 h (c) using no neighborhood and (d) using a neighborhood of ln = 0.525°. The PEAROME results are drawn with purple lines and the PEARP results with orange lines. D and D1 correspond to the first and second day’s forecast. The observations are provided by the ANTILOPE analysis.

  • Fig. 10.

    The BDnSS¯ (full lines), normalized reliability term (dashed lines), and normalized generalized resolution term (dotted lines) averaged over the year 2020 as a function of the lead time for the rainfall threshold 0.5 mm in 6 h (a) using no neighborhood and (b) using a neighborhood of ln = 0.525°. The lower panels correspond to the threshold 5 mm in 6 h (c) using no neighborhood and (d) using a neighborhood of ln = 0.525°. The PEAROME results are drawn with purple lines and the PEARP results with orange lines. The power of the statistical test on the difference in BDn¯ is set at 5% and is represented on the BDnSS¯ curve of PEAROME by the symbols: purple triangle up if PEAROME is better than PEARP, orange triangle down if PEARP is better than PEAROME, and circle if the difference is not significant at the 5% level. The observations are provided by the ANTILOPE analysis.

  • Fig. 11.

    (a) Reliability diagram (including a sharpness diagram) over the year 2020 at 1800 UTC on the first day of simulation for the threshold 0.5 mm in 6 h for PEAROME (purple lines) and PEARP (orange lines) using no neighborhood (full lines) and using a neighborhood of ln = 0.525° (dashed lines). The horizontal and vertical colored lines correspond to on¯ and fn¯, respectively. (b) The threshold 5 mm in 6 h. The observations are provided by the ANTILOPE analysis.

  • Fig. 12.

    (a) The BDnSS¯ over the year 2020 at 1800 UTC on the first day of simulation as a function of the neighborhood length scale ln for PEAROME (purple lines) and PEARP (orange lines) for the threshold 0.5 mm in 6 h. (b) The threshold 5 mm in 6 h. The result of the statistical test is represented in the same way as in Fig. 10. The observations are provided by the ANTILOPE analysis.

  • Fig. 13.

    As in Fig. 10, but for PEAROME (purple lines) and AROME (blue lines).

  • Fig. 14.

    As in Fig. 12, but for PEAROME (purple lines) and AROME (blue lines).

  • Fig. 15.

    As in Fig. 10, but for AROME (blue lines) and ARPEGE (red lines).

  • Fig. 16.

    The BDnSS¯ (full lines), FSS¯ (dashed lines), UNC (dotted black lines), and on2¯+fn2¯ (dotted-dashed lines) averaged over the year 2020 as a function of the lead time for the rainfall thresholds 0.5 mm in 6 h (a) using no neighborhood and (b) using a neighborhood of ln = 0.525°. The lower panels correspond to the thresholds 5 mm in 6 h (c) using no neighborhood and (d) using a neighborhood of ln = 0.525°. The PEAROME results are drawn with purple lines and the PEARP results with orange lines. The power of the statistical test on the difference in BDn¯ is set at 5% and is represented on the BDnSS¯ curve of PEAROME by symbols: purple upward triangle if PEAROME is better than PEARP, orange downward triangle if PEARP is better than PEAROME, and circle if the difference is not significant at the 5% level.

  • Fig. 17.

    As in Fig. 16, but for AROME (blue lines) and ARPEGE (red lines).

All Time Past Year Past 30 Days
Abstract Views 201 201 2
Full Text Views 485 485 426
PDF Downloads 178 178 101