1. Introduction
The use of probabilistic forecasts has increased significantly in recent years for numerical weather and climate simulations. To improve these forecasting systems, it is essential to have a robust verification system. In particular, the use of objective scores is an indispensable component of the verification system. The verification reference books of Jolliffe and Stephenson (2011) and Wilks (2011) contain a description of the usual scores for ensemble forecasts that depend on the nature of the parameter being forecast. In the case of continuous parameters such as temperature and rainfall amount, we can quote the continuous rank probability score (CRPS) which compares the predicted and observed cumulative distribution function (CDF). In the case of binary parameters corresponding to events such as the occurrence of fog, hail, tornado, or fixed threshold exceedances for continuous parameters, we can quote the Brier score (BS), the area under the receiver operating characteristic (ROC) curve, and the economic value (Jolliffe and Stephenson 2011).
In the case of the verification of deterministic predictions of binary parameters, most of the scores are based on 2 × 2 contingency tables which classify the observed and predicted events into four categories: correct forecasts of occurrence a, false alarms or incorrect forecasts of occurrence b, misses or incorrect forecasts of nonoccurrence c, and correct forecasts of nonoccurrence d (Jolliffe and Stephenson 2011). Among the contingency scores, we can mention the proportion of correct forecasts, the false alarm rate, the hit rate, the Heidke skill score, and the Peirce skill score (Jolliffe and Stephenson 2011). The traditional way in which to compute these scores is to count each category within the set of points and times where a couple formed by an observation and a forecast is available. The important thing to note is that in the traditional verification, only the prediction at the grid point is checked against the observation at the same point. If a forecast event is shifted from the observed event by a few tens of kilometers, this will produce both a false alarm and a miss in the contingency table. This is also called the double penalty. The conventional use of neighborhoods consists of computing synthetic information (average, maximum, minimum, etc.) from the different points of the neighborhood for the forecasts and/or the observations. If the neighborhood length scale noted ln in the following is larger than the location error between the expected and observed event, taking the neighborhood into account will allow the false alarm and the miss to compensate in the calculation of synthetic parameters. Roberts and Lean (2008) have proposed the application of the neighborhood concept by calculating averaged event frequencies in the neighborhood for forecasts and for observations instead of point values to remove the double penalties. The deviation between these forecasts and observed neighborhood frequencies is measured by computing a mean-square error used to build the fractions skill score (FSS).
Duc et al. (2013) generalized the neighborhood concept for the temporal dimension by considering forecast and observed frequencies within the different selected time points in a fixed time interval. They also applied the neighborhood concept to the case of ensemble forecasts by replacing the binary occurrences of events with the ensemble forecast probabilities at each point in the neighborhood. They then averaged these forecast probabilities on the different points of the neighborhood and compared them to the observed frequencies of the event in the same neighborhood in the same way as Roberts and Lean (2008). They called this new score the ensemble FSS. It was used by Schwartz et al. (2019) in order to evaluate the NCAR’s real-time convection-allowing ensemble.
Mittermaier (2014) proposed a new verification framework named the High-Resolution Assessment (HiRA) framework which was intended to better evaluate deterministic forecasts: these were transformed into probabilistic forecasts through the creation of an ensemble forecast, the members of which were made up of the forecasts at different points in a neighborhood. The probabilities forecast by this ensemble therefore represented average frequencies predicted in the neighborhood called neighborhood ensemble probabilities (NEPs) by Schwartz and Sobash (2017). The size of the neighborhood and the choice of the probabilistic score were tailored according to the parameter to be verified in HiRA. This framework was then generalized by Mittermaier and Csima (2017) so as to allow comparison between deterministic and ensemble forecasts by replacing the predicted point probabilities with the average of the predicted probabilities at each point in the neighborhood as in Duc et al. (2013). As with the deterministic case, the size of the neighborhood and the choice of the probabilistic score could vary in HiRA depending on the parameter and the model to be verified.
Stein and Stoop (2022) proposed the application of the neighborhood concept to the CRPS. To achieve this, they considered a two-step verification procedure: first, the forecasts of the M members of the ensemble forecast made on the N points of the neighborhood were pooled to form a neighborhood CDF and then either the CRPS was used to measure its deviation from the observation at the center of the neighborhood or the divergence function dIQ associated with the CRPS (integrated quadratic distance) was used to measure its deviation from the neighborhood CDF of the observations (Thorarinsdottir et al. 2013). In the first case, it therefore measured the interest of taking into account the forecasts of the other points to improve the forecast at the center of the neighborhood, while in the second case, it measured the quality of the forecast at the neighborhood scale superior to the grid scale of the forecasts and the observations.
The purpose of this paper is to show how the methodology used for the dIQ can also be used for the Brier divergence which is the divergence function associated with the BS (Thorarinsdottir et al. 2013) so as to reformulate the ensemble FSS of Duc et al. (2013) and then to present the generalization of the BS decomposition to the Brier divergence. The theoretical framework will be developed in section 2. Section 3 will be devoted to the decomposition of the Brier divergence. Section 4 will describe idealized cases not only to show how this score is sensitive to the bias and the dispersion of the probabilistic forecast but also to the spatial range of the correlation between observations. Section 5 will present the results of an intercomparison between two ensemble forecasts at high and low resolutions as well as two deterministic forecasts at a higher resolution than the members of the associated ensemble forecast. It contains a comparison of the new score to the ensemble FSS. Section 6 will contain the main conclusions of this study.
2. Consideration of the neighborhood in the Brier divergence
The BS is a negatively oriented score: the smaller it is, the better the prediction is. It is a proper score which allows the ranking of the forecasts among themselves and ensures that the best forecast is the one which most closely resembles the same distribution as the observations (Bröcker and Smith 2007).
To respect the independence of the different draws in the distribution of the observations, it is more rigorous to choose the verification points so that their respective neighborhoods are disjointed. This results in a reduction in the number of points used to calculate
3. The decomposition
The BS decomposition was introduced by Murphy (1973) in the case where one decomposes the summation over all verification points so as to obtain
The parameter m is arbitrary in the decomposition. We will choose bins centered on the values that the expected frequencies would take in the case of an ensemble forecast with M members:
4. Application to idealized cases
a. Numerical setup
We study the behavior of
The spatial domain contains 400 × 800 points. Data are simulated using R (R Core Team 2019) and geoR (Ribeiro et al. 2020) libraries. The observations are obtained by adding to the common background a complement that is more difficult to predict and which is drawn randomly from a centered normal distribution
Observation spatial field at one time constructed with a statistical distribution corresponding to σc = 1, σo = 0.2, and (a) Lc = 80 and (b) Lc = 2. A black circle with a radius equal to Lc is drawn at the center of the domain. Forecast spatial field of one member of the ensemble forecast built at the same time and with the statistical distribution corresponding to σc = 1, μf = 0.0, σf = 1.0, and (c) Lc = 80 and (d) Lc = 2.
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
The probabilistic forecasts are constituted by an ensemble forecast of 35 members. Each of the 35 forecasts is obtained in the same way as the observations: we add to the common background a complement drawn randomly from a normal distribution
b. Reference experiment EXP_ref
The reference experiment EXP_ref corresponds to the choices: Lc = 80, σc = 1, and σo = 0.2 for the observations and μf = 0 and σf = 0.2 for the forecasts. It corresponds to correlated observations over large distances to be compared with an unbiased probabilistic forecasting system giving forecast errors on average nearly 5 times smaller than the variability of the observations. Note that the forecasts of this reference experiment are drawn in the same distribution as the observations since μf = 0 and σf = σo. The chosen event is “the parameter being greater than 1.72”; this threshold has been chosen so that the observation mean frequency of occurrence is equal to 4.5%.
We compute the
Experiment EXP_ref: (a) sharpness diagram formed by plotting the number nk of forecast cases falling in bin k against the central predicted frequency of bin k for different neighborhood length scales ln: 1, 3, 5, 9, 15, 21, 35, and 49 grid points. The number of bins is 36 since the ensemble forecast contains 35 members. The vertical scale of the plot is logarithmic. (b) Variations of the terms of the decomposition of
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
The sharpness diagram shows that all bins of the predicted probability of the event are used. When considering larger and larger neighborhood sizes, the number of cases in each bin decreases because it is nearly proportional to the inverse of the number of points of the neighborhood
The term
The observed data are spatially correlated over distances of the order of Lc and this correlation is fully reflected in the ensemble forecast by construction. Thus, the forecasts of the neighboring points are informative for the forecast at the central point. This is the basic postulate of the neighborhood technique which can be applied: the frequencies averaged over the neighborhood fn and on of size less than Lc will then be closer than their local values f and g at the center of the neighborhood. This leads to a reduction of
We turn to
We tested the impact of varying all disjoint neighborhoods by a few points in both directions on the computation of
c. Variation according to the standard deviation σf
In a series of experiments, we vary the standard deviation of the forecast from σf = 0 to σf = 1. The other parameters are similar to those of EXP_ref: Lc = 80, σc = 1.0, σo = 0.2, and μf = 0. The σf = 0 corresponds to a deterministic forecast, since the dispersion of the ensemble is zero and the forecast is always equal to the background common to observations and forecasts. The σf = 0.2 corresponds to EXP_ref. The σf = 1 corresponds to a greater dispersion of forecasts than for EXP_ref and is on the same order of magnitude as the variability of the common background.
We define the frequency bias Bn at the neighborhood scale by
(a) Variation of the
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
The evolutions of
d. Variation according to the range Lc
The range parameter Lc was equal to 80 grid points for EXP_ref; we now vary this parameter by taking smaller values: 2, 8, 16, and 32 grid points while keeping the other parameters equal to those of EXP_ref: σc = 1, σo = 0.1, μf = 0, and σf = 0.2. Decreasing Lc corresponds to taking observations correlated over smaller distances. The
(a) Variation of the
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
We use the decomposition of
e. Variation according to the bias of the forecast
Two complementary experiments EXP_B+ and EXP_B− were performed in order to document the impact of a prediction additive bias by taking distributions of the prediction complement given by N(+0.1, 0.22) and N(−0.1, 0.22), respectively, and a distribution of the observation complement given by N(0.0, 0.22). All the other parameters are the same as for EXP_ref. The positive or negative additive bias corresponds nearly to ±10% of the standard deviation of the observations. We quantify the impact of these additive biases by calculating for the three experiments the frequency bias Bn for the event corresponding to the threshold 1.72: Bn = 1.22 for EXP_B+, Bn = 1.0 for EXP_ref, and Bn = 0.81 for EXP_B−.
For the biased experiments, we observe
(a) Variation of the
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
The
We now consider the case of biased forecasts for which we take a neighborhood size ln for verification that is large compared with Lc. We choose Lc = 2 and plot
Sharpness diagram with the same legend as Fig. 2a but for an experiment with Lc = 2, σc = 1, σo = 0.2, μf = 0, and σf = 0.2.
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
The experiment with μf = 0.1 is of lower quality than the equivalent experiment with μf = −0.1 when comparing
The
Different scores computed for the five experiments labeled by the μf value. The other parameters are the same for the five experiments: Lc = 2, σc = 1, σo = 0.2, σf = 0.2, and ln = 49. The asymptotic scores refer to Eq. (15).
Table 1 shows that the asymptotic behavior for large neighborhoods reproduces the large variations of these five experiments. The term
The BDn(climatology) measures the residual variability of on around by the value of
To illustrate the difference in behavior between
To conclude, we have illustrated by these three series of experiments that
5. Application to real cases
a. Description of the models and observations
For this study, we use the same four forecasting systems operational at Météo-France as in Stein and Stoop (2022):
-
The global deterministic hydrostatic model ARPEGE uses a stretched horizontal grid ranging from 5 km over France to 24 km over New Zealand (Courtier et al. 1991). Its initial conditions are produced by 4DVAR assimilation cycles over a 6-h time window (Desroziers et al. 2003). In this study, we use outputs on a regular latitude–longitude grid oversampled to 0.025° over Europe.
-
The ensemble version of the ARPEGE model, called Prévision d’Ensemble Action de Recherche Petite Echelle Grande Echelle (PEARP), consists of 35 members with a horizontal resolution reduced to 7.5 km over France and 32 km over New Zealand (Descamps et al. 2015). The different initial conditions are obtained by a mixture of singular vectors and perturbed members of a set of 50 assimilations. The model error is represented by a random draw for each member among a set of 10 different coherent physics. We use in this study the outputs on a regular latitude–longitude grid oversampled to 0.025° over Europe.
-
The deterministic nonhydrostatic limited-area model AROME (Seity et al. 2011) uses a 1.3-km horizontal grid over western Europe. The lateral boundary conditions are provided by the ARPEGE model, and the initial conditions stem from a cycle of hourly 3DVAR assimilations. In this study, we use the outputs on a regular latitude–longitude grid at 0.025° over western Europe.
-
The ensemble version of the AROME model, called Prévision d’Ensemble AROME (PEAROME), consists of an ensemble of 16 perturbed members with a horizontal resolution of 2.5 km over western Europe. Their lateral boundary conditions are provided by a selection of PEARP members. Their initial conditions come from an assimilation ensemble of 25 members at 3.5 km centered around the 3DVAR operational analysis of the deterministic AROME model (Bouttier and Raynaud 2018). The tendencies of some physical parameterizations are perturbed by a random multiplicative factor in order to model the model error. Outputs on a regular latitude–longitude grid at 0.025° over western Europe are used in this study.
Ground rainfall observations cumulated over 6 h are provided by the Analyse par Spatialisation Horaire des Précipitations (ANTILOPE) data fusion product (Laurantin 2008) between radar data from the French radar network and rain gauges. The horizontal resolution of this analysis is 1 km, and the data are averaged on the scale of the forecast outputs to be comparable, i.e., on the latitude–longitude grid at 0.025°.
The comparison between observations and forecasts is performed with the 0000 UTC networks of ARPEGE and PEARP as well as the 0300 UTC networks of AROME and PEAROME, which use asynchronous coupling files from ARPEGE and PEARP of 0000 UTC. The two events chosen for this study are cumulative rainfall over 6 h above the 0.5- and 5-mm thresholds. We will use two different neighborhoods: 0.025° corresponding to a neighborhood reduced to one point of the verification grid and 0.525° which involves 21 × 21 = 441 points. The verification period is spread over a period of 12 months from 1 January 2020 to 31 December 2020.
No restriction on the minimum number of observations present in the chosen neighborhood is imposed, but the mask of observed missing rainfall data is also applied to the forecasts in order to keep the same control points for forecasts and observations at all times. The verification domain therefore corresponds to metropolitan France, extended from the range of the radars (Fig. 7). The observed and forecast frequencies averaged in a neighborhood are therefore computed only for points which are not masked in this neighborhood. To evaluate the statistical significance of the observed differences between two forecasting systems, a block-bootstrap technique (Hamill 1999) is applied to the time series by randomly drawing blocks of 3 days from the 366 days of the verification period and the starting point among the nine possible ones so as to calculate with disjoint neighborhoods.
Occurrences of the event: (a) accumulated rainfall greater than 0.5 mm in 6 h observed by the ANTILOPE analysis at 1200 UTC 27 Jan are colored purple. Masked areas where no radar data are available are hatched. (b) The mean frequencies over the 0.525° neighborhood of this event computed from the ANTILOPE analysis are plotted with the color palette given at the bottom of the figure. The mean probabilities of the event predicted by PEARP are plotted with the same color palette (c) without neighborhood and (d) with a neighborhood of 0.525°. (e),(f) The same mean probabilities are plotted in the same way for PEAROME. Rectangular boxes are drawn in black to facilitate comparison of the different panels.
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
b. Comparison on a single case: 27 January 2020
In this section, we illustrate the calculation of
-
Ahead (A-1) and behind (A-2) the cold front, the spatial dispersion of PEARP members is more marked than that of PEAROME. PEAROME predicts finer structures than PEARP. Without a neighborhood, PEAROME’s BDn is worse than PEARP’s BDn, PEAROME being penalized by the double penalty. Applying a neighborhood of 0.525° reduces the impact of the double penalty more for PEAROME than for PEARP.
-
Over the English Channel (B), PEAROME predicts higher probabilities with finer structures than PEARP, but PEARP probabilities are closer to observed frequencies than those of PEAROME. PEAROME therefore has poorer BDn than PEARP because its probabilities are too high. The application of a neighborhood preserves this ranking between PEAROME and PEARP.
-
Over Brittany (C), PEAROME models well a zone with little probability of precipitation, corresponding to a respite between the passage of the front and the arrival of the rear sky. Conversely, the majority of PEARP members forecast precipitation over this zone and therefore high probabilities. PEAROME is closer to the observed frequency than PEARP, as can be seen on BDn with and without neighborhood.
-
Over the Pisa region (D), PEARP predicts significant probabilities over a fairly widespread area. PEAROME predicts high probabilities in a more localized manner with a slight northward shift compared with reality. PEAROME is closer to reality than PEARP, as BDn shows. PEAROME’s small location errors are totally redeemed by the 0.525° in contrast to PEARP’s false alarms.
-
Over the northeast of the control domain, both models predict high probabilities over an extended area, but nothing is observed. Both models are in outright false alarm with a BDn of 1. The application of a neighborhood does not redeem these false alarms.
BDn for the accumulated rainfall event greater than 0.5 mm in 6 h at 1200 UTC 27 Jan for PEARP calculated (a) without neighborhood and (b) with a neighborhood of 0.525°. The same BDn fields are plotted for PEAROME (c) without neighborhood and (d) with a neighborhood of 0.525°. The observations are provided by the ANTILOPE analysis. Masked areas where no radar data are available are hatched, and the same rectangular boxes are drawn as in Fig. 7.
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
Over the whole domain, the
c. Comparison of two ensemble forecasts during 1 year
The PEAROME forecasts evaluated without neighborhood against the ANTILOPE analysis degrades with the forecast time as shown by the temporal evolution of
The
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
Taking into account a neighborhood of 0.525° in
The
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
The decomposition of
(a) Reliability diagram (including a sharpness diagram) over the year 2020 at 1800 UTC on the first day of simulation for the threshold 0.5 mm in 6 h for PEAROME (purple lines) and PEARP (orange lines) using no neighborhood (full lines) and using a neighborhood of ln = 0.525° (dashed lines). The horizontal and vertical colored lines correspond to
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
PEARP presents lower quality rainfall forecasts against the ANTILOPE analysis than PEAROME since its
Comparing these two ensemble forecasts against the ANTILOPE analysis, we can conclude that PEAROME provides better rainfall forecasts for these two thresholds. In the case without neighborhood averaging, the difference is more important for light rainfall than for moderate rainfall and is mainly due to the reliability error of PEARP which is negligible for PEAROME. It is also important to note that the difference between the two ensemble forecasts increases when they are compared at the scale of a 0.525° neighborhood averaging. For light rainfall, we find that the impact of the neighborhood on the reliability term of PEARP on the
We return to 1800 UTC on the first day of simulation in order to analyze the variation of
(a) The
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
d. Comparison of an ensemble forecast with a deterministic forecast during 1 year
As in Fig. 10, but for PEAROME (purple lines) and AROME (blue lines).
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
In the case of the 0.525° neighborhood, the AROME forecast is transformed into a pooled probabilistic forecast in the neighborhood corresponding to a 441-member ensemble forecast. The
The growth of
As in Fig. 12, but for PEAROME (purple lines) and AROME (blue lines).
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
e. Comparison of two deterministic forecasts during 1 year
For the 0.5-mm threshold, we observe in Fig. 15 that AROME obtains better results than ARPEGE in terms of
As in Fig. 10, but for AROME (blue lines) and ARPEGE (red lines).
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
For the 5-mm threshold, the neighborhood-free
In the end, AROME predicts both light and moderate rainfall better than ARPEGE at both comparison scales.
f. Comparison of and during 1 year
The
The
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
The comparison between
As in Fig. 16, but for AROME (blue lines) and ARPEGE (red lines).
Citation: Monthly Weather Review 152, 5; 10.1175/MWR-D-22-0235.1
6. Conclusions
A two-step procedure was developed: 1) pooling of forecasts and observations of a binary event taken at all points in a neighborhood and 2) use of Brier divergence to measure the deviation between the predicted and observed neighborhood distributions. The score BDn thus constructed allows the reward of event forecasts close to the location where this event was observed. This methodology reformulates the ensemble FSS of Duc et al. (2013), which generalizes the FBS proposed by Roberts (2008) for the case of forecast ensembles. The BDn admits a deterministic limit obtained for a one-member ensemble forecasting binary frequencies. We can thus measure the contribution of ensemble forecasts compared to deterministic forecasts to better understand the uncertainty of forecasts.
This score has been used in idealized cases to show its ability to detect nonlocal spatial correlations between forecasts and observations. It was then used in real cases of quantitative rainfall forecasts for four deterministic and probabilistic forecasting systems operational at Météo-France. The term
The BS decomposition proposed by Stephenson et al. (2008) has been generalized to the BDn. It allows the highlighting of a new uncertainty term which is adapted to the case where the observed frequency is different from 0 or 1. It is equal to the Brier divergence of the sample climatology and represents the variance of the observations pooled in the neighborhood. This trivial forecast is independent of the forecast and is used as a reference to build a skill score for the Brier divergence. Next, we find a reliability term and three terms which together form the generalized resolution term: resolution term, within-bin forecast variance, and within-bin forecast-observation covariance. All the terms keep the same interpretation as in the original BS decomposition. This decomposition can be performed for an arbitrary number of bins and allows a better analysis of the faults and qualities of the forecasts as a function of the scale of the verification. The decomposition quantifies the reliability error and its variation when the scale of the verification is increased. A comparison of
Future work concerns the use of this decomposition in order to generate ROC and economic value diagrams including the notion of neighborhood. The term
Acknowledgments.
We are grateful to Naomi Riviere for reviewing this manuscript, to Olivier Mestre for generating the correlated observations with, and to the three anonymous reviewers and our editor M. Scheuerer for their valuable comments.
Data availability statement.
The operational model outputs come from the Météo-France archive. The deterministic forecasts can be obtained through the portal http://dcpcpnp-int-p.meteo.fr/openwis-user-portal/srv/en/main.home, but the ensemble forecasts are not public and are unavailable.
REFERENCES
Ahijevych, D., E. Gilleland, B. G. Brown, and E. E. Ebert, 2009: Application of spatial verification methods to idealized and NWP-gridded precipitation forecasts. Wea. Forecasting, 24, 1485–1497, https://doi.org/10.1175/2009WAF2222298.1.
Amodei, M., and J. Stein, 2009: Deterministic and fuzzy verification methods for a hierarchy of numerical models. Meteor. Appl., 16, 191–203, https://doi.org/10.1002/met.101.
Amodei, M., I. Sanchez, and J. Stein, 2015: Verification of the French operational high-resolution model AROME with the regional Brier probability score. Meteor. Appl., 22, 731–745, https://doi.org/10.1002/met.1510.
Ben Bouallègue, Z., T. Haiden, and D. S. Richardson, 2018: The diagonal score: Definition, properties, and interpretations. Quart. J. Roy. Meteor. Soc., 144, 1463–1473, https://doi.org/10.1002/qj.3293.
Bouttier, F., and L. Raynaud, 2018: Clustering and selection of boundary conditions for limited-area ensemble prediction. Quart. J. Roy. Meteor. Soc., 144, 2381–2391, https://doi.org/10.1002/qj.3304.
Bröcker, J., and L. A. Smith, 2007: Scoring probabilistic forecasts: The importance of being proper. Wea. Forecasting, 22, 382–388, https://doi.org/10.1175/WAF966.1.
Candille, G., and O. Talagrand, 2008: Impact of observational error on the validation of ensemble prediction systems. Quart. J. Roy. Meteor. Soc., 134, 959–971, https://doi.org/10.1002/qj.268.
Courtier, P., C. Freydier, J. Geleyn, F. Rabier, and M. Rochas, 1991: The Arpege project at Meteo France. ECMWF Seminar Proc., Reading, United Kingdom, ECMWF, 193–231, https://www.ecmwf.int/en/elibrary/74049-arpege-project-meteo-france.
Descamps, L., C. Labadie, A. Joly, E. Bazile, P. Arbogast, and P. Cébron, 2015: PEARP, the Météo-France short-range ensemble prediction system. Quart. J. Roy. Meteor. Soc., 141, 1671–1685, https://doi.org/10.1002/qj.2469.
Desroziers, G., G. Hello, and J.-N. Thépaut, 2003: A 4D-Var re-analysis of FASTEX. Quart. J. Roy. Meteor. Soc., 129, 1301–1315, https://doi.org/10.1256/qj.01.182.
Duc, L., K. Saito, and H. Seko, 2013: Spatial-temporal fractions verification for high-resolution ensemble forecasts. Tellus, 65A, 18171, https://doi.org/10.3402/tellusa.v65i0.18171.
Ebert, E. E., 2008: Fuzzy verification of high-resolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 51–64, https://doi.org/10.1002/met.25.
Gilleland, E., D. Ahijevych, B. G. Brown, B. Casati, and E. E. Ebert, 2009: Intercomparison of spatial forecast verification methods. Wea. Forecasting, 24, 1416–1430, https://doi.org/10.1175/2009WAF2222269.1.
Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155–167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.
Jolliffe, I. T., and D. B. Stephenson, Eds., 2011: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 304 pp., https://doi.org/10.1002/9781119960003.
Laurantin, O., 2008: Antilope: Hourly rainfall analysis merging radar and rain gauge data. Proc. Int. Symp. on Weather Radar and Hydrology, Grenoble, France, Laboratoire d’étude des Transferts en Hydrologie et Environnement (LTHE), 2–8.
Mittermaier, M. P., 2014: A strategy for verifying near-convection-resolving model forecasts at observing sites. Wea. Forecasting, 29, 185–204, https://doi.org/10.1175/WAF-D-12-00075.1.
Mittermaier, M. P., 2021: A “meta” analysis of the fractions skill score: The limiting case and implications for aggregation. Mon. Wea. Rev., 149, 3491–3504, https://doi.org/10.1175/MWR-D-18-0106.1.
Mittermaier, M. P., and G. Csima, 2017: Ensemble versus deterministic performance at the kilometer scale. Wea. Forecasting, 32, 1697–1709, https://doi.org/10.1175/WAF-D-16-0164.1.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600, https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
R Core Team, 2019: R: A language and environment for statistical computing. R Foundation for Statistical Computing, https://www.R-project.org/.
Ribeiro, P. J., Jr., P. J. Diggle, M. Schlather, R. Bivand, and B. Ripley, 2020: geoR: Analysis of Geostatistical Data, version 1.8-1. R package, https://CRAN.R-project.org/package=geoR.
Roberts, N., 2008: Assessing the spatial and temporal variation in the skill of precipitation forecasts from an NWP model. Meteor. Appl., 15, 163–169, https://doi.org/10.1002/met.57.
Roberts, N. M., and H. W. Lean, 2008: Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Wea. Rev., 136, 78–97, https://doi.org/10.1175/2007MWR2123.1.
Schwartz, C. S., and R. A. Sobash, 2017: Generating probabilistic forecasts from convection-allowing ensembles using neighborhood approaches: A review and recommendations. Mon. Wea. Rev., 145, 3397–3418, https://doi.org/10.1175/MWR-D-16-0400.1.
Schwartz, C. S., G. S. Romine, R. A. Sobash, K. R. Fossell, and M. L. Weisman, 2019: NCAR’s real-time convection-allowing ensemble project. Bull. Amer. Meteor. Soc., 100, 321–343, https://doi.org/10.1175/BAMS-D-17-0297.1.
Seity, Y., P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson, 2011: The AROME-France convective-scale operational model. Mon. Wea. Rev., 139, 976–991, https://doi.org/10.1175/2010MWR3425.1.
Stein, J., and F. Stoop, 2019: Neighborhood-based contingency tables including errors compensation. Mon. Wea. Rev., 147, 329–344, https://doi.org/10.1175/MWR-D-17-0288.1.
Stein, J., and F. Stoop, 2022: Neighborhood-based ensemble evaluation using the CRPS. Mon. Wea. Rev., 150, 1901–1914, https://doi.org/10.1175/MWR-D-21-0224.1.
Stephenson, D. B., C. A. S. Coelho, and I. T. Jolliffe, 2008: Two extra components in the brier score decomposition. Wea. Forecasting, 23, 752–757, https://doi.org/10.1175/2007WAF2006116.1.
Thorarinsdottir, T. L., T. Gneiting, and N. Gissibl, 2013: Using proper divergence functions to evaluate climate models. SIAM/ASA J. Uncertainty Quantif., 1, 522–534, https://doi.org/10.1137/130907550.
Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, 676 pp.