1. Introduction
The verification of meteorological forecasts is an unavoidable subject if one wants to be able to improve a forecast system. Numerous methods exist (Jolliffe and Stephenson 2011), each one possessing its application area among a wide range of forecast types: deterministic or probabilistic, forecasts of binary events, multicategorical, forecasts of continuous variables. In most centers of numerical weather prediction, the deterministic forecasts are evaluated by the calculation of scores such as the root of the mean square error, the mean Absolute Error (AE below), and the correlation of anomalies. In addition to these overall quality measures, evaluations of binary events, such as the presence of fog, rain, snow, and wind stronger than 100 km h^{−1}, can be included by using specific scores deduced from contingency tables which class the forecasts into four categories: hits, false alarms, misses, and correct rejections. A desirable property for this type of scores is the equitability, which guarantees that the constant forecasts and the random forecasts all have the same score (Jolliffe and Stephenson 2011). We invite the reader to consult the reference books of Jolliffe and Stephenson (2011) and of Wilks (2011) for a more thorough presentation of these scores.
A new stage has been reached through the introduction of neighborhood in the verification of binary cases by Roberts and Lean (2008). In the case of a forecast of a successfully observed but slightly staggered event, we will count in the case of a strict control a miss at the observation point and a false alarm at the expected location. This counting of a double error is called the double penalty. The double penalty particularly penalizes forecasts of highresolution models which often explicitly represent difficulttopredict small scale phenomenon with small location errors. The larger scale models tend to spread this type of phenomenon across large zones and underestimate their intensity. The introduction of a neighborhood consists of creating a tolerance on the spatial error of the forecast and thus allowing the reduction of the impact on the scores of this double penalty. Ebert (2008) makes a classification of different verification methods calling on the neighborhood according to whether the observations are only considered in the center of the neighborhood (single observation or “so” thereafter) or across the whole of the neighborhood (neighborhood observation or “no” thereafter). Roberts and Lean (2008) use the neighborhood frequency of events defined as the frequency of the event in a neighborhood. They compare the forecast neighborhood frequency to the observed neighborhood frequency of events in the same neighborhood. They consider an average quadratic error to quantify the difference between the fields of neighborhood frequencies. Several normalizations have been introduced to transform the absolute score into a relative score. Roberts and Lean (2008) use a linearly decorrelated forecast of the observation leading to the fractions skill score (FSS) while Amodei and Stein (2009) use the same score with a normalization via persistence forecasting providing therefore a neighborhood Brier skill score. These indicators are used operationally at the Met Office (Mittermaier et al. 2013) and at MétéoFrance (Amodei et al. 2015). Different trials to include the notion of neighborhood directly in the contingency tables have been reviewed by Schwartz (2017) with more or less symmetrical treatments of forecasts and observations on the local and neighborhood scales. Stein and Stoop (2019) propose an original method of error compensation inside the neighborhood between false alarms and misses which allows the building of a contingency table that takes into account the neighborhood and from which one can calculate all the associated contingency scores.
Probabilistic forecasts have been around for decades but the growing use of ensemble forecasts (using perturbated initial conditions and perturbated versions of the numerical model) means that probabilistic forecasts are now more common and in routine use than in earlier decades. One can cite as traditional measures of their quality: for the probabilistic forecasts of continuous outcome variables, the continuous ranked probability score (CPRS below) (Matheson and Winkler 1976; Hersbach 2000) and for binary events, the Brier score; the ROC curve (Mason 1982; Stanski et al. 1989) and the economic value (Richardson 2000). The part devoted to the verification of ensemble forecasts expands in each new edition of reference works concerning the verification of forecasts (Jolliffe and Stephenson 2011; Wilks 2011) in discussing the different properties of these scores such as that of propriety (the fact of being proper), which ensures that such a score cannot be faulted by a forecast strategy other than to improve forecasting (Gneiting and Raftery 2007). New promising scores have been recently proposed by Ben Bouallègue et al. (2018) such as the forecast skill card and the diagonal score, which accumulates the equitability and propriety properties. Fricker et al. (2013) and Ferro (2014) have introduced the notion of fair scores so as to better reward the ensemble forecasts, which are issued from the same probability distribution as the observations. Ferro (2014) proposes fair versions for the Brier score and for the CPRS. Zamo and Naveau (2018) have verified for an idealized case that only the fair estimator of the CRPS was nonbiased for ensemble forecasts of small size.
The aim of this article is to present how to integrate the notion of neighborhood in the verification of ensemble forecasts by pooling the forecasts or the observations for all the points of the neighborhood. The comparison of the forecast or observed pooled cumulative density functions (CDF) is performed in section 2 using either the CRPS or the score divergence associated to the CRPS (Thorarinsdottir et al. 2013). These scores are used for the idealized cases in section 3. Section 4 will present a verification of the quantitative precipitation and of 2m temperature forecast by two ensemble forecasts at high and low resolutions as well as by two deterministic forecasts with higher resolution than that of members of the ensemble associated forecasts. Section 5 contains a discussion of the results and section 6 presents the conclusions.
2. Neighborhood pooling strategy to evaluate ensemble forecasts
a. Verification against single observations
It should be clear that the CRPSuso evaluates the possible improvement brought by using a neighborhood as a postprocessing of the original ensemble forecast through the computation of the original proper score CRPS given by Eqs. (2) or (6). The CRPSuso is then averaged in the same way as the CRPS by Eq. (3). We can see that the energy formula can easily be applied to calculate the CRPS in the case described by Schwartz et al. (2010) or Ben Bouallègue and Theis (2014). This score allows the quantification of the improvement brought by the neighborhood pooling before calculating local probabilities as Mittermaier and Csima (2017) do in the HiRa framework.
b. Verification against neighborhood observations
To use neighborhood observations, we have to switch from a comparison between forecast CDF and point observations represented by a Heaviside function to the comparison of two CDFs. This is the purpose of Eq. (5): its lhs corresponds to the definition of the divergence function d(F, G) of the Cramer–Von Mises type and its rhs is the more general energy distance, a metric that measures the distance between the distributions of random vectors (Rizzo and Székely 2016). They are equal for realvalued random variables, which is our case. Both distances have the right properties to rank different forecasting systems against the empirical distribution of the observations (Rizzo and Székely 2016). Thorarinsdottir et al. (2013) call the divergence function associated to the CRPS, the integrated quadratic distance d_{IQ} and show that it is equivalent to a proper score.
The energy formulation of the calculation of the CRPSuno requires even more calculation resources when the neighborhood is large but its algorithm can be parallelized or optimized to use the redundancy of calculations from one point to another (a FORTRAN 90 routine parallelized through standard open MP directives is available from the authors).
By construction the CRPSuno quantifies the quality of the ensemble to forecast the neighborhood distribution of the observed field y at the scale of the neighborhood. We should therefore naturally expect to observe a reduction of the double penalty for the forecasts, particularly those of high resolution.
The CRPSuno is based on the comparison of the observed and forecast CDFs which take into account the whole range of values for the outcome variable whereas the SCRPS (Zhao et al. 2021) refers to a binary event. The CRPSuno and the SCRPS are therefore very different in nature. The link between the SCRPS [Eq. (1)] and the CRPS is provided by the use of the energy formulation of the CRPS [Eq. (6)] but applied to the neighborhood frequencies I_{o} and I_{n} instead of the CDFs. Moreover, the treatment of the observation and forecast neighborhood frequencies is not symmetric for the SCRPS as it is for the CRPSuno. This was possible for the CRPSuno by taking into account the probabilistic aspect of the observations which leads to the addition of a supplementary term related to the observations dispersion in the neighborhood [Eq. (5)].
We obtain a third term in relation to the CRPSuso_{det} corresponding to the dispersion of observed values in the neighborhood. This score is not a score used classically to evaluate the deterministic forecasts but it allows in a very natural way the introduction of the spatial tolerance in a score based on the AE. The neighborhood frequencies of observed and expected events compared by Roberts and Lean (2008) for binary outcomes are replaced by the CDFs of forecasts and observations pooled in the neighborhood for continuous outcomes.
c. Fair formulations
The correction of the denominators to remove the bias of the estimator of the dispersion terms is all the more important as the number of members is small. This could be clearly the case for the dispersion of the observations for small neighborhoods.
3. Applications to idealized cases
a. Cases with forecasts perfect but misplaced
We consider the idealized case: the observed field is zero everywhere except at the central point where its value is drawn randomly from the standard normal distribution every day of the time period of verification of 10 000 days (Fig. 1a). The forecast of this field is realized by a set of 16 members which also forecast a zero field across the whole of the forecasting area of 22 × 22 points except in one point where the 16 values are also drawn from the same standard normal distribution. The point where the forecast is not zero is confused with the observed field for EXP0, shifted from N points for EXPN for N ranging from 1 to 4 (Fig. 1a).
We evaluate the four neighborhood scores for these five experiments for a neighborhood reduced to one point over the verification area corresponding to the 20 × 20 central points. We then consider a neighborhood of N_{n} = 3 × 3 points for which we keep the same verification domain with the aim of calculating forecasts and observations without additional assumption on the edge points of the verification area.
If we verify these forecasts against the local observations in the case of neighborhoods 3 × 3, the average CRPSuso and the average CRPSfso remain much closer as the correction is carried out only on the term for the dispersion of forecasts which is weak because it involves N_{n}M members corresponding to a set of large size. EXP0 and EXP1 have the same average CRPSfso and average CRPSuso for a neighborhood of 3 × 3 points because the offset of 1 point changes nothing in terms of the neighborhood forecasts on the 9 points surrounding the center where the errors are not zero. This value is greater than CRPSu_{1 × 1}(EXP0) because the neighborhood forecast is not as good as the local forecast when one compares it to the local observation for EXP0. The neighborhood forecast is on the contrary better for EXP1. For the experiments EXP2 to EXP4, we obtain the same error which is a little higher than for EXP1 as no compensation is produced for a neighborhood of 3 × 3 points.
We also observe that most of the score reduction in this example comes from using neighborhood observations rather than from moving from raw forecasts to neighborhood pooled forecasts as shown for instance by the differences for EXP0 to EXP4 between the average CRPSuno and the average CRPSuso for 3 × 3 neighborhoods and the average CRPSu_{1×1}.
To conclude, these experiments show that the CRPSuso and the CRPSfso quantify the reduction of the error of pooled ensemble forecasts in relation to the case without neighborhood thanks to the spatial tolerance introduced by the mix of forecasts of the different neighboring points. Moreover, a supplementary reduction is quantified by the CRPSuno and CRPSfno when the central observations are also replaced by the neighborhood pooled observations.
b. Case with a uniform perfect forecast
In the idealized UNIF experiment, a uniform numerical setup is obtained by considering a simulation domain of 110 × 110 points, where an ensemble forecast of 16 members is compared to a reference: the observation and the 16 forecast values are drawn randomly and independently at each point from the same standard normal distribution. The verification domain corresponds to 100 × 100 points in the center of the simulation area so as to eliminate the effects of the lateral boundary conditions in the score variation when the neighborhood size changes. The process is iterated 1000 times. When the neighborhood is reduced to one point, the CRPSno (for either CRPSuno or CRPSfno) and the CRPSso (for either CRPSuso or CRPSfso) are equal by construction. As for the previous experiments, the fair values of average CRPS are smaller than the unfair values of the average CRPS (Fig. 1c). This difference decreases as the neighborhood grows.
For the verification against neighborhood observations, we note a sharp drop of the CRPSuno and the CRPSfno in Fig. 1c because the CDF of the neighborhood pooled forecasts (from N_{n}M values) compare even better to CDF of neighborhood pooled observations (from N_{n} values) as the number of neighborhood points involved increases. The unbiased estimator of the neighborhoodbased CRPS can be slightly negative due to sampling errors because only the unfair versions of the CRPS are numerically equal to the integral form of the CRPS [Eq. (2)] which is always positive. The bias removal in the energy formulation of the CRPS can lead to very small negative values when the limit value is zero for infinity size of the ensemble (Zamo and Naveau 2018).
For the verification against observations in the center of the neighborhood, we find far smaller variations according to the size of the neighborhood. In fact the mean value for the neighborhood N_{n} = n × n of the CRPSuso can be found once more by calculating the average CRPSu without neighborhood for an experiment with an ensemble forecasts which have 16N_{n} members. This result is coherent with the definition of the CRPSuso [Eq. (9)]. This leads to a very small variation of the average CRPSuso with the increase of neighborhood size equivalent to an increase of the ensemble size as in Fig. 1 of Zamo and Naveau (2018). Finally, we note again a smaller variation for the average CRPSfso than for the average CRPSuso as a function of N_{n} as the CRPSfso is a nonbiased estimator of the unknown value of the CRPS for this experiment and thus less sensitive to the effective size of the ensemble.
In conclusion, we have here the quantification of the impact of enlarging the effective size of the ensemble forecasts by pooling the forecasts at neighboring points either to improve (or not) the forecast CDF at the center of the neighborhood or to better fit the neighborhood pooled observed CDF.
4. Applications to real cases
a. Description of models and observations
For this study, we use four systems of operational forecasts at MétéoFrance:

The global deterministic hydrostatic model ARPEGE using a stretched calculation grid which goes from 5 km across France to 24 km across New Zealand (Courtier et al. 1991). Its initial conditions are provided by the 4DVAR assimilation cycles across time windows of 6 h (Desroziers et al. 2003). In this study, outputs are oversampled on the regular latitude–longitude grid at 0.025° across western Europe common to all models.

The ensemble version of the model ARPEGE, known as PEARP is constituted of 35 members with a horizontal resolution reduced to 7.5 km across France and 32 km across New Zealand (Descamps et al. 2015). These initial states are obtained by a mix of singular vectors and of perturbed members of an ensemble of fifty assimilations at 40 km. The model error is represented by a random draw for each member among a set of ten coherent and different physics. In this study, outputs are oversampled on the common grid in the same way as ARPEGE outputs.

The deterministic nonhydrostatic limited area model AROME (Seity et al. 2011) using a grid of 1.3 km across western Europe. The lateral conditions are provided by the model ARPEGE and the initial conditions come from a cycle of 3DVAR hourly assimilations. In this study, outputs are bilinearly interpolated on the regular latitude–longitude grid at 0.025° across Europe common to all models.

The ensemble version of the model AROME is known as PEAROME. It is made up of a set of 16 members having an horizontal resolution of 2.5 km across western Europe. Their lateral conditions are provided by a selection of members of PEARP. Their initial conditions come from an ensemble of assimilations of 25 members at 3.5 km centered around the 3DVAR operational analysis of the deterministic model AROME (Bouttier and Raynaud 2018). The tendencies of certain physical parameterizations are perturbed randomly by a multiplicative factor to represent the error model. In this study, outputs are interpolated on the common grid in the same way as AROME.
Rain observations are provided by the data fusion product ANTILOPE (Laurantin 2008) between radar data of the French radar network and the rain gauges (Fig. 2). The grid of this analysis is of 1km resolution and the data are averaged on the common grid at 0.025° of forecasts in order to be comparable. The verification domain common for all the four models corresponds to all the points where ANTILOPE is defined (Fig. 2). The comparison between observations and forecasts is realized with the forecasts starting at 0000 UTC for ARPEGE and PEARP as well as the forecasts starting at 0300 UTC for AROME and PEAROME, which use asynchronous hourly coupling files coming from ARPEGE and PEARP of 0000 UTC. The verification period spreads across a duration of 3 months from October to December 2019. No minimum number of observations present in the neighborhood around the central point is imposed to compute the CRPSno at this point. The mask for observed missing rainfall data is also applied to the forecasts so as to permanently keep the same number of points for the forecasts and observations in every neighborhood. This means that the number of points in the neighborhood varies when masked observations are present. The computation of the neighborhood scores is performed only when all the points of the neighborhood are inside the common verification domain.
ANTILOPE observation for the rain accumulated during 3 h at 1800 UTC 14 Oct 2019. The masked areas where no radar data are available are dashed. The verification domain for the APROFUS analysis of 2m temperature is plotted with a bold black line.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
ANTILOPE observation for the rain accumulated during 3 h at 1800 UTC 14 Oct 2019. The masked areas where no radar data are available are dashed. The verification domain for the APROFUS analysis of 2m temperature is plotted with a bold black line.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
ANTILOPE observation for the rain accumulated during 3 h at 1800 UTC 14 Oct 2019. The masked areas where no radar data are available are dashed. The verification domain for the APROFUS analysis of 2m temperature is plotted with a bold black line.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
So as to see the impact of the distribution law of several outcome variables, we also consider the 2m temperature. The observations are also provided by a data fusion product named APROFUS (Marcel and Laurantin 2019) which mixes the 2m temperature observations of the observational network operational across France with the AROME analysis on a kilometer grid. The value of the observation is obtained in the center of each forecast grid by the bilinear interpolation of the four APROFUS points which surround it. The verification domain common to all the four models for 2m temperature corresponds to the APROFUS domain plotted in Fig. 2 except at the border, while the neighborhood scores are computed only if all the points of the neighborhood are inside the APROFUS domain.
So as to evaluate the statistical significance of observed differences between two forecast systems, a blockbootstrap technique (Hamill 1999) is applied to time series choosing randomly 1000 times with delivery of the 3day blocks among the 92 days of the verification period. The level of the test is fixed to 5% for all the six possible comparisons of the four systems.
b. Quarterly verification of cumulative rainfall over 3 h
We will start by analyzing the results of these comparisons with the CRPSfno when the neighborhood is reduced to one grid point. The CRPSfno is equal to the fair version of the CRPS (Ferro 2014) and the deterministic version corresponds thus to an AE. The deterministic models AROME and ARPEGE have similar average scores (Fig. 3a). The analysis of AROME seems more accurate than that of ARPEGE as an initial condition for short term rain forecasts. This can probably be explained by the assimilation of radar data which is only present for AROME. However, from 18 h of forecasts, the mean AE are equivalent and we even observe across the last lead times an advantage for ARPEGE. The average CRPS of ensemble forecasts are significantly far better than the mean AE of deterministic forecasts showing a real benefit in the evaluation of CDF observed by ensemble forecasts in relation to a deterministic forecast even at higher resolution. The growth rate of the forecast error measured by the average CRPS is between 2 and 3 times greater for the deterministic forecasts than for the ensemble forecasts. Finally, PEARP and PEAROME provide close values of the CRPS.
(bottom) The average CRPSfno (in mm) for the rain accumulated during 3 h as a function of the lead time for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno are computed (a) without neighborhood (or 1 × 1 point), (b) with a neighborhood size 0.125° (5 × 5 points), and (c) 0.275° (11 × 11 points). (top) The results of the bootstrap tests for the six possible pairs of models to be compared. (topleft) Names of the models of the comparison. Filled black downward triangle indicates in the following panels that the first model of the comparison performs significantly better at the 5% level than the second model of the comparison, empty upward triangles indicate the opposite case, and no symbol is present if the difference between both models is not significant at the 5% level. These symbols are plotted as a function of the lead time for the same three cases reported in the bottom panels.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
(bottom) The average CRPSfno (in mm) for the rain accumulated during 3 h as a function of the lead time for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno are computed (a) without neighborhood (or 1 × 1 point), (b) with a neighborhood size 0.125° (5 × 5 points), and (c) 0.275° (11 × 11 points). (top) The results of the bootstrap tests for the six possible pairs of models to be compared. (topleft) Names of the models of the comparison. Filled black downward triangle indicates in the following panels that the first model of the comparison performs significantly better at the 5% level than the second model of the comparison, empty upward triangles indicate the opposite case, and no symbol is present if the difference between both models is not significant at the 5% level. These symbols are plotted as a function of the lead time for the same three cases reported in the bottom panels.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
(bottom) The average CRPSfno (in mm) for the rain accumulated during 3 h as a function of the lead time for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno are computed (a) without neighborhood (or 1 × 1 point), (b) with a neighborhood size 0.125° (5 × 5 points), and (c) 0.275° (11 × 11 points). (top) The results of the bootstrap tests for the six possible pairs of models to be compared. (topleft) Names of the models of the comparison. Filled black downward triangle indicates in the following panels that the first model of the comparison performs significantly better at the 5% level than the second model of the comparison, empty upward triangles indicate the opposite case, and no symbol is present if the difference between both models is not significant at the 5% level. These symbols are plotted as a function of the lead time for the same three cases reported in the bottom panels.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
If we introduce a neighborhoods of size 0.125° by including 5 × 5 points in the comparison between these four forecasts, all the models have better average CRPSfno (Fig. 3b). This indicates that the double penalty is reduced when the neighborhood pooled CDFs replace their local estimations allowing compensations between members and localizations inside the neighborhood. This is clearly more active for deterministic forecasts than for ensemble forecasts and leads to a reduction of the gap between their average CRPSfno curves. The neighborhood methodology is therefore validated again for the deterministic forecasts for continuous outcomes in the same way as with FSS (Mittermaier et al. 2013) or neighborhoodbased table of contingency (Stein and Stoop 2019) for categorical outcomes. It can be seen that the average CRPSfno of AROME is now significantly better than the CRPSfno of ARPEGE for all lead times indicating that the double penalty correction by neighborhood treatments is more active for a high resolution model as expected. The neighborhood treatment also works well for ensemble forecasts and helps to quantify how neighborhood probabilities are better forecast by ensemble forecasts than local probabilities: the average CRPSfno_{det} of ensemble forecasts is reduced by 0.07 mm for all lead times when the neighborhood includes 5 × 5 points (Fig. 3b) instead of 1 × 1 point (Fig. 3a). Moreover, the average CRPSfno_{det} of the deterministic forecasts grow faster with the lead time than the average CRPSfno of ensemble forecasts in the same proportion as with a neighborhood reduced to one point.
When we enlarge the neighborhood to 0.275° (11 × 11 points), all the models have better scores (Fig. 3c). We note again nonetheless the same conclusions as for the previous neighborhood: the AROME analysis allows an improvement of the AROME forecasts on short lead times by getting closer to the quality of the ensemble forecasts but these last ones are then significantly better due to the slower speed of the error growth. According to this indicator PEAROME and PEARP are of equivalent quality for this scale even if the small but significant difference is in favor of PEAROME. AROME performs significantly better than ARPEGE for all lead times.
So as to have a more synthetic view of the dependence of scores with the size of the neighborhood, we select two lead times at 6 and 42 h so as to have a lead time greatly influenced by the initial conditions and a lead time farther away (Fig. 4). We can measure the performance of the AROME analysis across all the scales when we compare ARPEGE and AROME forecasts on a short lead time (Fig. 4a). This is no more the case at longer lead times (Fig. 4b) and the forecasts become very close even if the use of neighborhoods greater than one point reduces significantly more the average CRPSno for AROME than for ARPEGE. The predictive content for AROME also outperforms the PEARP content at 6 h for scales greater than 0.0525° or 21 × 21 points according to this score (Fig. 4a). However, we see that on a long lead time, the ensemble forecasts provide an undeniable benefit converging between themselves for all scales because the growth of the error here is far slower than for the deterministic forecasts (Fig. 4b).
(bottom) The average CRPSfno (in mm) for the rain accumulated during 3 h as a function of the neighborhood size (in °) for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno values are computed at (a) 0600 UTC at day D and (b) at 1800 UTC at day D + 1. (top) The results of the bootstrap tests as in Fig. 3, but as a function of neighborhood size (written in 0.001° in the second line of the table).
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
(bottom) The average CRPSfno (in mm) for the rain accumulated during 3 h as a function of the neighborhood size (in °) for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno values are computed at (a) 0600 UTC at day D and (b) at 1800 UTC at day D + 1. (top) The results of the bootstrap tests as in Fig. 3, but as a function of neighborhood size (written in 0.001° in the second line of the table).
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
(bottom) The average CRPSfno (in mm) for the rain accumulated during 3 h as a function of the neighborhood size (in °) for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno values are computed at (a) 0600 UTC at day D and (b) at 1800 UTC at day D + 1. (top) The results of the bootstrap tests as in Fig. 3, but as a function of neighborhood size (written in 0.001° in the second line of the table).
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
We plot on the Fig. 5 the time series of the daily average CRPSfno and the averages of daily rain observed across the verification area. We note the very strong time correlation between these two quantities which shows that the rainiest cases are also the least well predicted. In contrast the ranking of the four forecasts between themselves remains very stable. It depends little on the observed rain and it is coherent with the mean rank shown at lead time 18 h on Fig. 4b. It is a good property of this score which shows its capacity to compare deterministic and probabilistic forecasts for a given scale whereas the forecasts use different resolutions as well as the possibility to exploit this score on a daily basis.
Daily temporal series of the average CRPSfno (in mm) for the rain accumulated during 3 h for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno are computed with neighborhood sizes equal to 0.125° at 1800 UTC on day D. The daily temporal series of the rain accumulated during 3 h observed by ANTILOPE and averaged over the whole domain is also superimposed with a dotted line.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
Daily temporal series of the average CRPSfno (in mm) for the rain accumulated during 3 h for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno are computed with neighborhood sizes equal to 0.125° at 1800 UTC on day D. The daily temporal series of the rain accumulated during 3 h observed by ANTILOPE and averaged over the whole domain is also superimposed with a dotted line.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
Daily temporal series of the average CRPSfno (in mm) for the rain accumulated during 3 h for PEARP (light gray full line), PEAROME (black full line), ARPEGE (light gray dashed line), and AROME (dark gray dashed line). The average CRPSfno are computed with neighborhood sizes equal to 0.125° at 1800 UTC on day D. The daily temporal series of the rain accumulated during 3 h observed by ANTILOPE and averaged over the whole domain is also superimposed with a dotted line.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
Any improvement from the neighborhood pooling postprocessing can be quantified by a comparison of the average CRPSfso to the average CRPS of the raw ensemble forecast. We plot for the average CRPSfso on the Fig. 6 the equivalent for the average CRPSfno of the Fig. 4. The comparison of these two figures shows that the dependence on the neighborhood is very different with a quasisaturation for the average CRPSfso for the large neighborhoods whereas the average CRPSfno decreases regularly with the size of the neighborhood. The difference between these two behaviors comes by design of the taking into account of the dispersion term of observations in the neighborhood which reduces the CRPS [Eqs. (9) and (13)] and allows the average CRPSfno to be smaller than the average CRPSfso for large neighborhoods. This corresponds to a profound process in the verification which is to take into account the same spatial scale for the observations and the forecasts. This approach is in the same spirit as the taking into account of the errors of observations (Ferro 2017) or indeed the errors of representativity (Ben Bouallegue et al. 2020). Future studies will analyze how these approaches can merge in this formalism. It is not easy to interpret the quasisaturation of the average CRPSfso for the large neighborhoods because we could imagine that the points that are far away must have a negative contribution to the CDF of the ensemble forecasts because they are decorrelated from the central point where the observation is taken into account while the contribution remains almost zero from these points. One possible interpretation of the quasisaturation of the average CRPSfso is that the observed and predicted precipitation fields are poorly correlated for small neighborhoods due to the large variance. As a result, the correlation errors are already saturated at small scales, and any decorrelations in the large scale precipitation that result from large forecast neighborhoods do not affect the average CRPSfso. Reductions in the forecast variance reduce the average CRPSfso score for small to medium sized neighborhoods, but the variance becomes vanishingly small for large neighborhoods, limiting its effect. It can be noted that deterministic and ensemble forecasts give very close values of the average CRPSfso for the large neighborhoods showing the relevance of replacing the local deterministic information by a smoother flow of information corresponding to neighborhoodized frequencies built with neighboring points.
As in Fig. 4, but for the average CRPSfso for the rain accumulated during 3 h.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
As in Fig. 4, but for the average CRPSfso for the rain accumulated during 3 h.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
As in Fig. 4, but for the average CRPSfso for the rain accumulated during 3 h.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
The plots of the other two unfair average scores CRPSuno and CRPSuso (not shown) differ from their average equivalents fair CRPSfno and CRPSfso by only 1% at most. It shows the little impact of the correction of the normalization of dispersion terms in real cases. These variations are therefore completely negligible compared to those generated either by the passage from “so” to “no,” either by the variation of the neighborhood or by the variation of the lead time, at least for the ensemble sizes considered.
c. Quarterly verification of 2m temperatures
The aim of this section is to show the impact of the distribution of the outcome variable or its relation to orography, on the variation of scores with the neighborhood size. For this, we plot the variations of the CRPSfno and the CRPSfso according to the size of the neighborhood for the 2m temperature (Figs. 7 and 8), which are to be compared to their equivalents for the accumulated rain across 3 h (Figs. 4 and 6). For the CRPSfno, we note that the impact of the resolution is clearer for the forecast of 2m temperature than for the rain as even with large neighborhoods we observe neither the convergence of scores of the ensemble forecasts PEAROME and PEARP on a long lead time nor the convergence of the AROME and ARPEGE deterministic forecasts. Instead, we observe a grouping by system resolution, with the highresolution PEAROME and AROME systems significantly outperforming the lowresolution PEARP and ARPEGE systems at the 5% significance level. Surely AROME and PEAROME have an advantage over ARPEGE and PEARP since APROFUS uses AROME analyses as background to be combined with station data. Moreover the ensemble systems perform substantially better than their respective deterministic counterparts when both the lead time is long and the neighborhood size is small.
As in Fig. 4, but for the average CRPSfno for the 2m temperature. The reference is provided by the APROFUS temperature analysis.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
As in Fig. 4, but for the average CRPSfno for the 2m temperature. The reference is provided by the APROFUS temperature analysis.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
As in Fig. 4, but for the average CRPSfno for the 2m temperature. The reference is provided by the APROFUS temperature analysis.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
As in Fig. 4, but for the average CRPSfso for the 2m temperature. The reference is provided by the APROFUS temperature analysis.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
As in Fig. 4, but for the average CRPSfso for the 2m temperature. The reference is provided by the APROFUS temperature analysis.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
As in Fig. 4, but for the average CRPSfso for the 2m temperature. The reference is provided by the APROFUS temperature analysis.
Citation: Monthly Weather Review 150, 8; 10.1175/MWRD210224.1
The dependence on the size of the neighborhood is very different for the CRPSfso of the 2m temperature since the taking into account of the neighboring points improves the local forecast only for the neighborhoods of small size and frankly degrades it for larger neighborhoods (Fig. 8). This can be explained by the existing decorrelation for the 2m temperature in a given point and its neighbors, which can be at very different heights and thus little representative of the uncertainty other than that which is local. It illustrates that neighborhood averaging, if not done carefully, can add biased members to the ensemble. Such a behavior also exists in the case of the CRPSfno as the possibility to recover among the neighboring points the forecasts closer to 2m temperature in relation to an observation at a given location is reduced as soon as the orography varies significantly. However in this case, we treat coherently the forecasts and the observations, which allows us to keep an informative score depending on the scale analyzed.
5. Discussion of the results
The CRPSso could be preferred for an end user who is mainly interested by the forecast at a given location. In this case, the CRPSso helps to quantify the improvement brought by the forecasts at neighboring points to the forecast at the central point. It has been shown that this improvement can be negative in the case where the observations strongly depend on the location as for the temperature case and that the CRPSso warns you about a bad use of neighborhood as a posttreatment in this configuration. Moreover, the CRPSso could reward forecasts aggregated and smoothed coming from neighboring points more than forecasts coming from the statistical distributions of the observations taken at neighboring points and therefore leading to counterintuitive results. Indeed, the propriety of the CRPS only says that the best forecast is the one extracted from the statistical distribution of the observations at the same central point. It does not say anything about the optimality of a forecast built from different statistical distributions of the observations taken at different points of the neighborhood.
The CRPSso has the disadvantage to compare ensemble forecasts and observations at two different scales and following Mittermaier and Csima (2017), we should try to limit the size of the neighborhood to a maximum value, under which the assumption of equiprobability of the observations in the neighborhood is not broken in order to use the neighborhood methodology to reduce the double penalty. This is clearly not the case for temperature over France where high mountains are present. One supplementary question comes from the overlapping of neighborhoods when we try to use the CRPSso as a metrics to minimize in order to statistically postprocess the forecasts. The forecasts at one point will belong to multiple neighborhoods and could be submitted to contradictory constraints from observations coming from different statistical distributions. For this reason, it is prudent to limit the use of the CRPSso to diagnostic posttreatments with reasonable sizes.
The CRPSno compares the ensemble forecasts aggregated over the neighborhood to observations aggregated in the same way. The propriety of the d_{IQ} guarantees that the best ensemble forecast at the scale of the neighborhood will be the one extracted from the statistical distribution of the observations aggregated over the neighborhood. Even if the statistical distribution of the observations varies inside the neighborhood, the CRPSno is able to indicate how the forecasts neighborhood distribution fit the observations neighborhood distribution. This property pleads for the CRPSno which allows you to objectively compare at a given scale a forecast with observations and even forecasts of different resolutions with the same reference. One needs to be careful when comparing the “no” scores across neighborhood sizes. Since the type of observations changes with the neighborhood size, the true best expected score also changes. Thus, a smaller “no” score for a neighborhood size S2 than for a neighborhood size S1 can be due either to a better pooled ensemble forecast at the size S2 than at the scale S1 or to a better predictability of the pooled observations at the scale S2 than at the scale S1.
The choice of a meaningful neighborhood size for the CRPSno is not easy because it clearly depends on the outcome variable. There is no systematic asymptotic value for large neighborhoods for the CRPSno as reported by Roberts and Lean (2008) for the spatial verification of deterministic forecasts of binary events. Nevertheless, the value of the neighborhood size can be fixed by an external constraint like in Amodei et al. (2015) to be representative of an administrative department or in Taillardat and Mestre (2020) to realize a balance between the representativity of rain and the numerical cost.
6. Conclusions
We perform an implicit postprocessing step by linearly pooling the forecasts of different members of different points of the neighborhood to make a superensemble realizing a forecast on the neighborhood scale. The CRPS is then used to compare the pooled forecast CDF with that provided by the only observation in the center of the neighborhood. This CRPS expression for pooled ensemble forecasts is called CRPSso and it quantifies therefore the relevance of using the points of neighboring forecasts to improve the local probabilistic forecast. We also compare this pooled forecast CDF to that corresponding to all the observations collected in the neighborhood by using the divergence function d_{IQ} derived from the CRPS. This second CRPS expression for pooled ensemble forecasts is called CRPSno. This score quantifies the quality of the forecast on the scale of the neighborhood considered. Fair versions of these two neighborhood scores have also been constructed so as to reduce the bias for the ensembles of small size when forecasts and observations are issued from the same statistic law in following the recommendations of Ferro (2014) and Zamo and Naveau (2018). The extensions of these neighborhood scores in the deterministic case allow one to generalize the AE coherently taking into account the neighborhoods in the calculation of these scores. We can thus directly compare the generalized AE of deterministic forecasts with these CRPS expressions of pooled ensemble forecasts.
A series of idealized cases has allowed us to show the sensitivity of these scores to spatial error of placement of structures and to the relevance of the identical treatment of observations and forecasts to improve their comparisons. These four neighborhood scores have also been used to compare the forecasts of rain accumulated during 3 h and of 2m temperature for four systems of deterministic forecasts and probabilistic forecasts operational at MétéoFrance using different resolutions. We were able to show the superiority of the ensemble forecasts over deterministic forecasts for long lead times due to the slower growths of their forecast errors. The CRPSso helps us to show that the use of neighboring points to improve the forecast of the local probability in the center of the neighborhood is very dependent on the forecast outcome variable with a very rapid deterioration in the case of 2m temperature and in contrast a rapid saturation for rain for large neighborhoods. Because the CRPSno treats coherently the forecasts and the observations in the neighborhood aggregation, it should be preferred to the CRPSso for the comparison of models at the scale of the neighborhood. The CRPSno variation as a function of the neighborhood size will help to quantify the influence of the double penalty on the relative quality of two ensemble forecasts pooled at the neighborhood scale.
It is very important to evaluate precisely the qualities and the faults of a forecast system not only for the whole range of values but also for selected thresholds. For this, work is in progress to introduce the neighborhood pooling as a first step before the computation of the Brier Score in the same spirit as for the CRPS expressions. In addition, the calculation of these CRPS expressions for pooled ensemble forecasts is going to be even more optimized so as to reduce the high computational cost for the large neighborhoods.
Acknowledgments.
We are grateful to Maxime Taillardat and Michaël Zamo for discussions on the divergence functions, to Naomi Riviere for reviewing this manuscript, and to the three anonymous reviewers for their valuable comments.
REFERENCES
Amodei, M., and J. Stein, 2009: Deterministic and fuzzy verification methods for a hierarchy of numerical models. Meteor. Appl., 16, 191–203, https://doi.org/10.1002/met.101.
Amodei, M., I. Sanchez, and J. Stein, 2015: Verification of the French operational highresolution model AROME with the regional Brier probability score. Meteor. Appl., 22, 731–745, https://doi.org/10.1002/met.1510.
Baringhaus, L., and C. Franz, 2004: On a new multivariate twosample test. J. Multivar. Anal., 88, 190–206, https://doi.org/10.1016/S0047259X(03)000794.
Ben Bouallègue, Z., and S. E. Theis, 2014: Spatial techniques applied to precipitation ensemble forecasts: From verification results to probabilistic products. Meteor. Appl., 21, 922–929, https://doi.org/10.1002/met.1435.
Ben Bouallègue, Z., T. Haiden, and D. S. Richardson, 2018: The diagonal score: Definition, properties, and interpretations. Quart. J. Roy. Meteor. Soc., 144, 1463–1473, https://doi.org/10.1002/qj.3293.
Ben Bouallègue, Z., T. Haiden, N. J. Weber, T. M. Hamill, and D. S. Richardson, 2020: Accounting for representativeness in the verification of ensemble precipitation forecasts. Mon. Wea. Rev., 148, 2049–2062, https://doi.org/10.1175/MWRD190323.1.
Bouttier, F., and L. Raynaud, 2018: Clustering and selection of boundary conditions for limited area ensemble prediction. Quart. J. Roy. Meteor. Soc., 144, 2381–2391, https://doi.org/10.1002/qj.3304.
Courtier, P., C. Freydier, J. Geleyn, F. Rabier, and M. Rochas, 1991: The ARPEGE project at MeteoFrance. Proc. ECMWF Workshop on Numerical Methods in Atmospheric Models, Reading, United Kingdom, ECMWF, 193–231.
Descamps, L., C. Labadie, A. Joly, E. Bazile, P. Arbogast, and P. Cébron, 2015: PEARP, the Météo France shortrange ensemble prediction system. Quart. J. Roy. Meteor. Soc., 141, 1671–1685, https://doi.org/10.1002/qj.2469.
Desroziers, G., G. Hello, and J.N. Thépaut, 2003: A 4DVar reanalysis of FASTEX. Quart. J. Roy. Meteor. Soc., 129, 1301–1315, https://doi.org/10.1256/qj.01.182.
Ebert, E. E., 2008: Fuzzy verification of highresolution gridded forecasts: A review and proposed framework. Meteor. Appl., 15, 51–64, https://doi.org/10.1002/met.25.
Ferro, C. A. T., 2014: Fair scores for ensemble forecasts. Quart. J. Roy. Meteor. Soc., 140, 1917–1923, https://doi.org/10.1002/qj.2270.
Ferro, C. A. T., 2017: Measuring forecast performance in the presence of observation error. Quart. J. Roy. Meteor. Soc., 143, 2665–2676, https://doi.org/10.1002/qj.3115.
Fricker, T. E., C. A. T. Ferro, and D. B. Stephenson, 2013: Three recommendations for evaluating climate predictions. Meteor. Appl., 20, 246–255, https://doi.org/10.1002/met.1409.
Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359–378, https://doi.org/10.1198/016214506000001437.
Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155–167, https://doi.org/10.1175/15200434(1999)014<0155:HTFENP>2.0.CO;2.
Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559–570, https://doi.org/10.1175/15200434(2000)015<0559:DOTCRP>2.0.CO;2.
Jolliffe, I. T., and D. B. Stephenson, Eds., 2011: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 292 pp., https://doi.org/10.1002/9781119960003.
Laurantin, O., 2008: Antilope: Hourly rainfall analysis merging radar and rain gauge data. Proc. Int. Symp. on Weather Radar and Hydrology, Grenoble, France, Laboratoire d’étude des Transferts en Hydrologie et Environnement (LTHE), 2–8.
Marcel, E., and O. Laurantin, 2019: New infrahourly frequency analyses of basic parameters (temperature, humidity, wind, sea level pressure). Research Report 2019, MétéoFrance Research Rep., 44–45, http://www.umrcnrm.fr/IMG/pdf/r_r_2019_gb_web.pdf.
Mason, I. B., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30, 291–303.
Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci., 22, 1087–1096, https://doi.org/10.1287/mnsc.22.10.1087.
Mittermaier, M. P., 2014: A strategy for verifying nearconvectionresolving model forecasts at observing sites. Wea. Forecasting, 29, 185–204, https://doi.org/10.1175/WAFD1200075.1.
Mittermaier, M. P., and G. Csima, 2017: Ensemble versus deterministic performance at the kilometer scale. Wea. Forecasting, 32, 1697–1709, https://doi.org/10.1175/WAFD160164.1.
Mittermaier, M. P., N. Roberts, and S. A. Thompson, 2013: A longterm assessment of precipitation forecast skill using the Fractions Skill Score. Meteor. Appl., 20, 176–186, https://doi.org/10.1002/met.296.
Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126, 649–667, https://doi.org/10.1002/qj.49712656313.
Rizzo, M. L., and G. J. Székely, 2016: Energy distance. Wiley Interdiscip. Rev. Comput. Stat., 8, 27–38, https://doi.org/10.1002/wics.1375.
Roberts, N. M., and H. W. Lean, 2008: Scaleselective verification of rainfall accumulations from highresolution forecasts of convective events. Mon. Wea. Rev., 136, 78–97, https://doi.org/10.1175/2007MWR2123.1.
Schwartz, C. S., 2017: A comparison of methods used to populate neighborhoodbased contingency tables for highresolution forecast verification. Wea. Forecasting, 32, 733–741, https://doi.org/10.1175/WAFD160187.1.
Schwartz, C. S., and R. A. Sobash, 2017: Generating probabilistic forecasts from convection allowing ensembles using neighborhood approaches: A review and recommendations. Mon. Wea. Rev., 145, 3397–3418, https://doi.org/10.1175/MWRD160400.1.
Schwartz, C. S., and Coauthors, 2010: Toward improved convectionallowing ensembles: Model physics sensitivities and optimizing probabilistic guidance with small ensemble membership. Wea. Forecasting, 25, 263–280, https://doi.org/10.1175/2009WAF2222267.1.
Seity, Y., P. Brousseau, S. Malardel, G. Hello, P. Bénard, F. Bouttier, C. Lac, and V. Masson, 2011: The AROMEFrance convectivescale operational model. Mon. Wea. Rev., 139, 976–991, https://doi.org/10.1175/2010MWR3425.1.
Stanski, H. R., L. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. 2nd ed. Research Rep. MSRB 895, WWW Tech. Rep. 8, WMO/TD 358, World Meteorological Organization, http://www.cawcr.gov.au/projects/verification/Stanski_et_al/Stanski_et_al.html.
Stein, J., and F. Stoop, 2019: Neighborhoodbased contingency tables including errors compensation. Mon. Wea. Rev., 147, 329–344, https://doi.org/10.1175/MWRD170288.1.
Székely, G. J., and M. L. Rizzo, 2005: A new test for multivariate normality. J. Multivar. Anal., 93, 58–80, https://doi.org/10.1016/j.jmva.2003.12.002.
Taillardat, M., and O. Mestre, 2020: From research to applications—Examples of operational ensemble postprocessing in France using machine learning. Nonlinear Processes Geophys., 27, 329–347, https://doi.org/10.5194/npg273292020.
Thorarinsdottir, T. L., T. Gneiting, and N. Gissibl, 2013: Using proper divergence functions to evaluate climate models. SIAM/ASA J. Uncertainty Quantif., 1, 522–534, https://doi.org/10.1137/130907550.
Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. International Geophysics Series, Vol. 100, Academic Press, 704 pp.
Zamo, M., and P. Naveau, 2018: Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Math. Geosci., 50, 209–234, https://doi.org/10.1007/s1100401797097.
Zhao, B., B. Zhang, and Z.l. Li, 2021: A CRPSbased spatial technique for the verification of ensemble precipitation forecasts. J. Trop. Meteor., 27, 24–33, https://doi.org/10.46267/j.10068775.2021.003.