## 1. Introduction

Probabilistic forecasts are nowadays widely used in the meteorological community, since they provide a useful estimate of the predictive uncertainty. In an operational context, these forecasts are generally in the form of ensembles representing possible scenarios. Despite progress in the verification field since their emergence, the complexity of their behavior still represents a great challenge for verification practitioners (Casati et al. 2008). In few words, verification is the action of assessing the quality of the forecasts by comparing them to their corresponding observations (Jolliffe and Stephenson 2003). Since a complete picture of the forecast quality cannot be obtained from a single measure, different verification measures have been proposed, which evaluate different attributes (i.e., aspects) of the forecast quality (Murphy 1973). All measures, though, have in common the fact that they require a large number of forecast–observation pairs in the verification sample for being statistically robust. To help increase the sample size, various forecasts may be pooled together (in the same sample), for example, for different locations, for various ranges of predictands, or from different model versions. However, computing a verification measure over an inhomogeneous sample faces the risk of having different forecast behaviors that average out. Stratification, as the process of partitioning the verification sample into different subsets, aims at conditioning the verification measure to specific conditions, so as to minimize this risk and lead to more insightful verification case studies.

It is difficult to trace back the origin of the term *stratification*, since the concept has probably emerged soon after first meteorological forecasts were verified. Indeed, authors very often present performance measures for different locations or seasons, which is an implicit way of stratifying the complete verification sample. Such an approach aims at making measures of forecast skill independent from the climatological frequency of events that have to be verified, which varies both in space and time (Hamill and Juras 2006). Moreover, modelers are accustomed to conditioning verification case studies to specific meteorological conditions when improving numerical weather prediction models. However, it appears that the term *stratification* has been mostly used in the literature with the purpose of assessing the significance of different subsets in term of their contribution to the overall verification measure. Murphy (1995), in the first devoted paper, extended his general framework for forecast verification (Murphy and Winkler 1987) to stratification along different meteorological conditions, in the case of probability forecasts of dichotomous events.

This article concentrates on the field of ensemble forecasts of continuous scalar variables. Hereafter, we consider an *ensemble* as a discrete approximation of a full forecast distribution. This definition encompasses forecasts issued by meteorological ensemble forecasting techniques (Buizza et al. 1999) but also probabilistic forecasts issued by other forecasting techniques such as statistical adaptations, like the analog method (Obled et al. 2002; Hamill and Whitaker 2006), or single-value forecast dressings (Schaake et al. 2007). Two widely used verification measures for ensemble forecasts are the continuous ranked probability score (CRPS) (Matheson and Winkler 1976; Hersbach 2000; Gneiting and Raftery 2007) and the rank histogram (Anderson 1996; Hamill and Colucci 1997; Talagrand et al. 1997). As a measure of forecast calibration, the rank histogram has been subject to stratification in past studies, as advocated by Hamill (2001). He suggests stratification along a statistic of the ensemble in order to detect conditional biases that would be hidden when computing the rank histogram over the whole sample. As a stratification criterion, authors have used the mean and the standard deviation of the ensemble (Hamill and Colucci 1997), or well-correlated quantities (Hamill and Colucci 1998; Bröcker 2008). A substantial contribution to the underlying theory has been made by Siegert et al. (2012), who expressed the risk of statistical artifacts that may affect the interpretation of rank histograms when stratifying along a statistic of a finite-size ensemble. Alternatively, Mullen and Buizza (2002) and, indirectly, Bellier et al. (2016), have stratified rank histograms along the observation. Although Siegert et al. (2012) have mentioned the risk of similar artifacts, theoretical aspects related to the latter approach has, to the knowledge of the authors, not been studied yet.

The rank histogram does not evaluate though how accurate is a forecast. Verification reports of ensemble forecasts very often include the average CRPS as a summary measure of the overall forecast accuracy. Previous contributions have mostly focused on its decomposition into different parts corresponding to specific attributes of the forecast (Hersbach 2000; Bontron 2004; Candille and Talagrand 2005). Only few studies (Gneiting and Ranjan 2011; Lerch et al. 2017) have tackled the CRPS under the stratification approach, by studying the properties of the score when it is averaged over a restricted subset of the verification sample.

In this article, we propose a general stratification framework for ensemble forecasts of continuous scalar variables, and detail different ways of stratifying: along a function of the observation, of the forecast or of an external criterion. Within this framework, a new formulation of the average CRPS is derived. Concerning the rank histogram, the work done by Bröcker (2008) and Siegert et al. (2012) is extended to the problematic case in which stratification is made along a function of the observation, where it is shown that calibrated forecasts do not lead to flat histograms over each stratum. For both the CRPS and the rank histogram, new graphical representations are proposed that synthesize the information coming from stratification into low-complexity charts. To evaluate their potential benefits, two real datasets of probabilistic precipitation forecasts, having similar skills but different behaviors, are verified: ensemble forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF) and analog-derived forecasts, statistically adapted from the ECMWF control forecast.

The article is organized as follows. Observation and forecast datasets are presented in section 2. Section 3 describes the stratification formalism, while sections 4 and 5 detail the application on the CRPS and the rank histogram, respectively. Section 6 presents results from a numerical example. Key points related to stratification are discussed in section 7. Section 8 concludes.

## 2. Observation and forecast data

Some aspects of the stratification framework benefit from illustrations based on real data. For ease of understanding, these data are presented first. The weather variable of interest (i.e., the predictand) is mean areal precipitation (MAP) accumulated at a 6-h time step over hydrological catchments, in the perspective of hydrological forecasting. Please note that the formalism in sections 3, 4, and 5 applies to any other continuous scalar weather variables.

Figure 1 shows the 10 considered catchments located in France just downstream from Lake Geneva, with areas ranging from 290 to 3760 km^{2}. MAP observations used for verification were processed by Météo-France by kriging hourly and daily rain gauge data from the Météo-France network. The considered period is from 1 January 2010 to 31 December 2014.

Two forecast datasets are examined, both coming from the 0000 UTC cycle. The first dataset (labeled as ECMWF-Ens) contains 50-member ensemble forecasts produced by the ECMWF Ensemble Prediction System (EPS) (Buizza et al. 1999) and downloaded from the TIGGE database (Park et al. 2008). Only the perturbed members are considered here. Thiessen-based averaging (Tabios and Salas 1985) has been used to transform grid-based forecasts to MAP forecasts. The second dataset (labeled as ECMWF-Ana) contains 40-member forecasts produced by statistical downscaling of the ECMWF control forecast using an analog method developed successively by Obled et al. (2002), Bontron (2004), Ben Daoud et al. (2011), Marty et al. (2012), and Ben Daoud et al. (2016). In a nutshell, the synoptic forecast situation, characterized by means of large-scale predictors (geopotential height, temperature, and humidity), is taken from the control member. Then, most *analog* synoptic situations are selected among an archive of reanalyses. Finally, MAP observations having been recorded on these dates are selected and constitute the forecast in the form of an ensemble. Although generally related to EPS, the terms *ensemble* and *member* are here used to describe analog-based forecasts as well. More information about this dataset can be found in Bellier et al. (2016).

## 3. General stratification framework

### a. Overview

*Stratification*, within the verification context, is defined as the action of dividing the sample of historical forecast–observation pairs into different subsets, according to a *stratification criterion*. The different subsets are called *strata* (singular: stratum). Hereafter, it is considered as implicit that this is done with the intent of computing performance measures over each of these subsets. The underlying objective is to partition the complete sample into strata that show different behaviors, in order to better understand strengths and weaknesses of the studied forecasting system. In this article, verification concerns forecasts of weather variables that are *continuous*, such as precipitation or temperature, and *scalar* (as opposed to multivariate), which presupposes a given location and time. It does not exclude, however, the possibility of pooling in the same sample forecasts from different locations and/or times if it is operationally justified. In previous dedicated studies (Bröcker 2008; Siegert et al. 2012), the formalism has been developed for stratification along a function of the forecast only. What follows is an extension where the criterion may depend on any characteristics of the forecast–observation pair.

*f*or cumulative distribution function (CDF)

*F*. Generally, an operational probabilistic forecast is available in the form of an ensemble of values instead of a full distribution. Thus, consider a finite-size ensemble drawn independently from

*f*and sorted in ascending order:where

*M*is the size of the ensemble, assumed to be constant. An ensemble constructed this way is called a Monte Carlo ensemble (Siegert et al. 2012). We suppose that operational forecasts behave as such. Finally, let

*y*denote the verifying observation.

*y*is a random variable,

*f*is a random distribution. Let denote by

*outcomes*. Therefore, one can constitute a verification sample:which contains

*N*forecast–observation pairs with forecasts in the form of ensembles, supposed to be drawn from the latent (and unavailable) distributions

*T*can be defined a stratification criterion

*θ*. For example,

*θ*is categorical, or to

*θ*is a scalar, a vector of size

*k*, or a field of size

*θ*are denoted by

*θ*to one of the

*S*discrete indices corresponding to the different strata. After stratification, the

*s*th stratum contains all forecast–observation pairs satisfying

*S*strata are mutually exclusive but collectively exhaustive [i.e., every pair

*s*th stratum. In this section, the stratification function

*observation*,

*forecast*, or

*external*based, depending on the origin of the data the criterion

*θ*is taken as a function of. Possible reasons justifying each of the three approaches are suggested. For the first two, distinction is made between

*statistic-*and

*meteorology*-oriented strategies: on the one hand,

*θ*is a direct function of

*θ*represents meteorological covariates that are not strictly contained in

### b. Observation-based stratification

A forecaster may wonder: *how did the forecasts behave when specific events have occurred?* Such question raises the need of an observation-based stratification. Within the statistic-oriented strategy, the criterion is taken as the verifying observation, that is ^{−1}) have occurred. The sample *T* will then be divided using

Within the meteorology-oriented strategy, *θ* is taken as one or several meteorological covariate(s) of the observation. A typical example is the weather regime, defined as a large-scale spatial atmospheric pattern that has been identified among a finite set of possible ones (Michelangeli et al. 1995; Vrac and Yiou 2010). The information about the observed weather regime is not strictly contained in *y*, but close links may exist between both. For precipitation for instance, an observed anticyclonic weather regime is strongly associated with outcomes where *N* elements of *T*, already been identified and classified into one of the possible weather regimes. Then *θ* contains information about the weather regime, in the form of, for example, a vector of different meteorological variables or a spatial field of a given variable (e.g., geopotential height). In this case, stratification requires the definition of a distance metric, like the Euclidean distance if *θ* is a vector, or the S1 score (Teweles and Wobus 1954) if *θ* is a field. Based on the computation of this distance over all couples of

### c. Forecast-based stratification

If the question is now: *when given forecasts are issued, how do they behave?*, a forecast-based stratification is justified. Considering first the statistic-oriented strategy, one can consider the criterion *f* from which the ensemble is supposed to be drawn. However, since the latent forecast PDF *f* is unknown, an estimation

*θ*as a single statistic of

*integrated quadratic distance*(Thorarinsdottir et al. 2013), defined aswhich satisfies all axioms of a metric. Its formulation can be seen as an extension of the CRPS as defined later in Eq. (2), where the distribution

*T*into different strata. Such a stratification approach is applied on the rank histogram in the numerical example in section 6.

Within the meteorology-oriented strategy, forecasts generated by similar meteorological situations are gathered into the same stratum. This is, however, more complicated than in the observation-based case where single meteorological situations are associated with each element of *T*. Here, if forecasts come from a meteorological EPS, each member is associated with a given meteorological situation. In other words, considering for example that *θ* corresponds to the weather regime, there are possibly *M* different weather regimes associated with a given forecast *M* members, or the control’s one. Nevertheless, none of these methods seems entirely satisfactory and this aspect is reserved for future studies.

### d. External-based stratification

Finally, one may consider the following: *in a given forecasting environment, how do the forecasts behave?* Here, the forecasting environment refers to information that is *external* to the forecasts and observations themselves (either predictand values or meteorological covariates). As a consequence, this approach can be combined with any of the two previously presented. For example, *θ* can be taken as the location, if spatial disparities are suspected among forecasts for different locations, or as the month of the year, in order to detect seasonal biases in the forecasting model. Note that stratifying along the season is different from along the weather regime (either observed or predicted), the former considering only the forecast date while the latter is a flow-dependent approach. Furthermore, samples of operational forecasts can cover a period that includes one or several model upgrades. To assess their impact on verification measures, one can therefore take *θ* as the model version. For any of these criterion,

As a concluding remark of this section, the classification of stratification approaches we propose can also be viewed under the perspective of the time at which the criterion *θ* is available. In an observation-based stratification, *θ* is unknown at the forecast time. In a forecast-based stratification, *θ* is known at the forecast time since it directly depends on the forecast (either the ensemble itself or some meteorological covariates). Finally, in an external-based stratification the criterion *θ* is known before the forecast time, as it does not depend on the forecast but only on the forecasting environment discussed above. This perspective is essential for an appropriate usage of stratification in the verification process, as will be discussed in section 7.

## 4. Application on the CRPS

*n*of the verification sample, aswhereis the Heaviside function. It is negatively oriented, meaning that smaller values are better. Since the forecast is in the form of an ensemble

*T*, yielding what we refer to as the

*overall*CRPS:This quantity can be subject to stratification from two different perspectives.

### a. Interpretation of the restricted CRPS

*restriction*perspective. As suggested by Lerch et al. (2017), we define the

*restricted*CRPS, denoted by

*s*th stratum:where

*propriety*property of the restricted CRPS. As a desirable property, a verification score is

*proper*if it rewards forecasters who issue forecasts that correspond to their true belief, and if it does not suggest any explicit hedging strategy (Gneiting and Raftery 2007). Gneiting and Ranjan (2011) have shown that the restricted CRPS is improper under an observation-based stratification with

### b. Decomposition of the overall CRPS using stratification

*decomposition*perspective. Using the stratification function

*accumulated stratified*CRPS, which enables one to easily assess the significance of each stratum in terms of contribution to the overall CRPS.

Note that each stratification approach defined in section 3 can potentially be applied on this decomposition, since the equality in Eq. (5) remains true irrespective of

## 5. Application on the rank histogram

The rank histogram (RH) is a diagnostic tool aimed at assessing the *calibration* of the forecasts (Anderson 1996; Hamill and Colucci 1997; Talagrand et al. 1997). Note that, in the literature, other words for calibration are sometimes used: *reliability* or *statistical consistency*. Unlike the CRPS, the RH is constructed on a collective basis, meaning over a sufficiently large set of forecast–observation pairs.

### a. Assessing calibration using the rank histogram

*calibrated*if, and only if, the conditional probability distribution

*n*is the index of a possible outcome. Since forecasts are in the form of ensembles, Eq. (6) can be formulated as ensemble members (

*y*being drawn from the same distribution

*n*. In what follows, we express mathematically how the RH verifies this property.

*y*as a realization from

*y*falls between members

*y*is therefore just one more draw from

*f*

_{n}. Thus, it is over a large number of different

*bins*, which are said to be populated when

*N*of the sample. We express such an instance bywhich tends to the strict equality as

*N*approaches infinity. A significant nonflatness indicates a miscalibration. The appealing feature of the RH is that one can graphically learn, from the shape of the histogram, where the deficiencies in the forecasting system lie:

According to the definition in Eq. (6) of calibration from Jolliffe and Stephenson (2003), the proper way to assess calibration would be to construct the RH on a sample containing only forecasts drawn from the same distribution *on average*, while the strict definition of calibration would imply the equality holding *for each different distribution*. Murphy and Epstein (1967), Yates (1982), and Bontron (2004) have referred to the former definition as *in-the-large* calibration and to the latter as *in-the-small* calibration. It is important to highlight that in-the-small calibration implies in-the-large calibration, while the contrary is not true. Thus, flatness of the overall RH is a necessary but not sufficient condition for calibration, as first mentioned by Hamill (2001). If the assessment of in-the-small calibration is in practice infeasible since datasets hardly contain ensemble forecasts from the same distribution, an insight can be obtained with a forecast-based stratification which gathers into same stratum forecasts that are similar.

The present framework for forecast calibration differs from the theoretical framework proposed by Gneiting et al. (2007), although connections between both exist. Gneiting et al. (2007) defined several modes of calibration, namely *probabilistic*, *exceedance*, and *marginal* calibration, with *strong* calibration when all three hold. Their probabilistic calibration is equivalent to the above-defined in-the-large calibration, and assessed by checking the flatness of the RH constructed over a nonstratified sequence of forecast–observation pairs. Furthermore, they introduced the concept of *completeness*: complete calibration (regarding one or several modes) is verified if the calibration mode(s) holds for any possible subsequences of forecast–observation pairs. This concept, loosely defined though, shares with the in-the-small calibration definition the idea that the assessment of calibration over a set of forecasts in which distributions differ from each other faces the risk of having different behaviors that average out. The other modes defined by Gneiting et al. (2007), namely, exceedance and marginal calibration, are not considered in our present framework, but it seems reasonable to assume that in-the-small calibration should imply both exceedance and marginal calibration; at least we cannot think of a counterexample.

### b. The concept of stratified rank histograms

*θ*. Equation (7) can then be rewritten asA RH constructed over the strata

*s*is then represented byas a function of

*overall*RH is represented as the sum of

*S stratified*RH colored differently. We define this representation as the

*accumulated stratified RH*, which enables one to easily assess the contribution of each stratum to the overall RH.

Stratification of the RH is relevant provided that flatness is expected over each stratum for calibrated forecasts. If this condition is not satisfied, one could hardly infer from the shape of the histograms what comes from the stratification process and what is due to miscalibration of the forecasts. Flatness of stratified RHs requires the equality in Eq. (9) to hold for all *θ* is independent of

### c. Forecast-based stratified rank histograms

Bröcker (2008) and Siegert et al. (2012) have shown that, under calibration, flatness is expected with a stratification criterion *f* computed from the ensemble *κ* causes random sampling errors *y* is also drawn from

Two factors play a role in this undesirable artifact. The first one is the frequency at which forecasts overlap the bounds delineating the different strata. This is linked to their relative sharpness (i.e., their sharpness compared to their own climatology). Sharp forecasts are less likely to overlap the bounds of the strata. Therefore, the sharper the ensembles are, the weaker the artifact. In the above example, the artifact is maximized by the fact that all ensembles *M*. The more members the ensembles have, the smaller random sampling errors *n*, splitting randomly each ensemble

As an alternative to trying to eliminate the artifact, this article proposes a graphical test to evaluate its impact so as to take it into account when interpreting histogram shapes. The objective is to construct stratified RHs for the forecasts to be verified at hand, although under a *perfect-model* assumption. The procedure is as follows: for each element of *T*, one member is randomly withdrawn from the ensemble forecast and considered as the new verifying observation. The so-obtained forecasts are perfectly calibrated regarding these *pseudo-observations* since both forecast members and the pseudo-observation are drawn each time from the same distribution. They also correctly respect the two characteristics of the original forecasts regarding the undesirable artifact. Indeed, the ensemble size is not much changed (only one member less), and the relative forecast sharpness should remain equivalent. Note though that this assumption is reasonable for large ensembles like ECMWF-Ens but might not hold for much smaller ensembles. The second step consists in applying the forecast-based stratification on this dataset. If stratified RHs, for each stratum, do not show any significant deviation from flatness, then no undesirable artifact is likely to occur with the same stratification applied on the original forecast–observations pairs. If discrepancies appear for some strata, they have to be taken into account when interpreting stratified RH back to the original data. If necessary, one can even consider abandoning this stratification. Another interesting point of such a graphical test is that considering the same sample size enables us to graphically assess, a priori, how random fluctuations will affect the interpretation of RH shapes when considering the original data.

Results of an experiment of such a graphical test are given in Fig. 4. Original forecasts are the 50-member ECMWF-Ens MAP forecasts that have been described in section 2. Stratification is done along the ensemble mean with ^{−1}]. Vertically, the effect of the ensemble size is tested, with 49 and 5 (randomly selected) members. Horizontally, the effect of the relative sharpness is tested, by considering different lead times of the forecasts: 18–24 and 114–120 h. Indeed, ensemble forecasts become less sharp as lead time increases, because of the limited predictability of the atmosphere. As expected, all overall RHs are flat, as a consequence of the perfect-model assumption. The top-left histogram does not exhibit any visible deviation from flatness for any of the strata, meaning that the stratification applied on this forecast dataset is relevant regarding the artifact described above. However, one can detect slight slope compensations between the strata when reducing the ensemble size from 49 to 5, which is amplified for the 114–120-h lead time. As a consequence, care must be taken in the interpretation of stratified RH when going back to the original data with such characteristics.

### d. Observation-based stratified rank histograms

Although forecast-based stratified RHs are justified for the assessment of *in-the-small* calibration, observation-based stratified RHs look attractive to answer the question: *how did the forecasts behave when specific events have occurred?* In the following, we extend the work made by Bröcker (2008) and Siegert et al. (2012) to demonstrate, however, that calibrated forecasts do not lead to flat stratified RHs under an observation-based stratification.

*S*strata. Let us define

*i*, as evidenced by the graphical interpretation in the bottom panel of Fig. 3. In this specific case,

*s*.

As a consequence, flat RHs over the different strata are not expected with calibrated forecasts when stratifying along the observation. The sharper ensembles are compared to the climatology of the observation, the weaker ones will be the deviations from flatness. This artifact vanishes as

## 6. Numerical example

In this section, we illustrate the potential benefits of stratification through the verification of the two forecast datasets presented in section 2. An *observation*-based stratification with ^{−1} are rare (

Then, a *forecast*-based stratification is carried out for the assessment of calibration using RHs. The 42–48-h lead time is here considered. As a preliminary step for both datasets, forecast–observation pairs for the 10 catchments were pooled together since they were found to behave similarly (according to stratified RH within an external-based stratification along the catchments, not shown). This enables us to enlarge the size of the sample *T*. In this example, *N* = 18 200 pairs. Then, *T* is stratified using a clustering technique, with *integrated quadratic distance* [cf. Eq. (1)] as the metric for the dissimilarity between two distributions and the *Ward* distance (Murtagh and Legendre 2014) as the distance between two clusters. As all other data handling, this process has been done within the R environment (R Development Core Team 2014). The hclust function from the R package stats has been used. The bottom panels of Figs. 6 and 7 show the distributions populating each strata of ECMWF-Ens and ECMWF-Ana forecasts, respectively. To detect if this stratification is subject to a statistical artifact affecting the interpretation of the forecast-based stratified RH, the graphical test proposed in section 5c has been applied (not shown) and no significant deviations from flatness due to stratification are expected.

The accumulated stratified RH is represented in the top panels of Figs. 6 and 7. For the sake of ease of interpreting stratified RH shapes, individual RHs for each stratum are also plotted in the middle panel. Several insights about forecast behavior can be obtained from this stratification. First, recall that when several members take the same value as the verifying observation, corresponding bins are populated randomly. It occurs very frequently when dealing with variables such as precipitation that have a point mass in zero. Stratum (a) represents forecasts with all members equal to zero. Nonzero observations can therefore populate the

## 7. Discussion

### a. Should we consider observation-based stratification?

As discussed in section 4 dedicated to the CRPS, previous studies have shown that observation-based stratification is problematic if forecasters need to compute restricted CRPS over specific strata and to interpret them individually from one stratum to another. This would consist, for instance, in using the restricted CRPS for ranking different forecasting systems, or as the objective function of an optimization process within the forecast postprocessing step. For such needs, forecast-based stratification is recommended instead, as the restricted CRPS remains proper. Note also that weighted versions of the CRPS (Gneiting and Ranjan 2011) that emphasize, inside the integral of Eq. (2), a specific region of the predictand’s range are alternative possibilities.

The second approach discussed in section 4 that decomposes the overall CRPS into the contributions coming from the different strata is nonetheless free of theoretical barriers to any stratification strategies. We remind the reader that it is a way to better understand the sensitivity of the overall CRPS to specific subsets of the verification sample, but not to evaluate the forecast accuracy over each subset individually. Possible reasons for advocating an observation-based rather than a forecast-based approach would be, for instance, the desire to learn more about the CRPS behavior on climatological forecasts (widely used as reference forecasts in skill scores), or to ensure same sample sizes in strata when comparing different forecast datasets.

The case of the RH is intrinsically different as it is, unlike the CRPS, constructed and interpreted on a collective basis. It has been shown in section 5 that the stratification process can impact such interpretation, as evidenced by artifacts yielding nonflat RHs constructed with forecasts under the perfect-model assumption. Both observation and forecast-based stratification approaches are concerned, although to different extents. In the former case, the artifact comes from a misuse of the RH as a way to assess calibration. Calibration (or miscalibration) is indeed a forecast property that one wants to be aware of before observations occur. This is the underlying principle of postprocessing, where forecast biases can be identified and conditioned to forecasts or external characteristics so as to be corrected at forecast time. The assessment of calibration has therefore no reason to be conditioned on the future. This would face the risk of drawing erroneous conclusions about forecast behavior. In a past study within a hydrological forecasting context, Bellier et al. (2016) have constructed RH over a sample containing high-flow events selected according to observed peak flow values. Strong

Instead, a forecast-based stratification is perfectly justified as it tends to approach the “true” assessment of forecast calibration by gathering into same stratum forecasts that are similar. The potential artifact in forecast-based stratified RHs is purely statistical and results from the fact that ensemble forecasts have a finite number of members. Its strength thus strongly reduces for large ensembles. Moreover, the graphical test we propose, based on the perfect-model assumption, enables one to assess a priori whether or not the stratification is reasonable. We therefore advocate, as long as care is taken, for forecast-based stratification when computing RHs.

### b. Connection between CRPS and rank histogram

Hersbach (2000) has proposed a decomposition of the CRPS into a reliability part and a resolution/uncertainty part. The reliability part is closely connected to the RH. For each bin *i*, the squared difference between the average frequency that the observation falls below the middle of the bin and the corresponding forecast probability *N* of the sample approaches infinity. It is essential to note, however, that this decomposition does not apply to individual

### c. The issue of sample size

As mentioned earlier, the verification of ensemble forecasts requires a *sufficiently* large sample of forecast–observation pairs. Otherwise, the average CRPS will fluctuate for each pair being added to or withdrawn from the sample and RH’s bins will not be populated enough to correctly interpret the shape. Quantitatively, what *sufficiently* means is not in the scope of this paper, yet it has been tackled by several authors. Not exhaustively, Candille et al. (2007) propose for the CRPS to account for sample size with bootstrap methods (Efron and Tibshirani 1994). Goodness-of-fit tests for RH flatness exist (Elmore 2005; Jolliffe and Primo 2008), and Bröcker (2008) also suggests the plotting of each RH on a probability paper in order to give quantitative information as whether deviations from flatness are due to sample size or indicate a systematic bias. Nevertheless, it is important to highlight the fact that sample size is the major constraint to the stratification process. It is especially true for the assessment of calibration using forecast-based stratified RHs, which would in theory require a large number of ensemble forecasts drawn from the same distribution. Therefore, a compromise has to be found between the need of strata large enough for a robust verification and the desire to learn more on how forecasts behave. For example, the forecast-based stratification in the numerical example was constrained to six strata due to sample size. With such a restricted number of strata in the case of precipitation, it hardly differs from a stratification along the ensemble mean. Nevertheless, the authors have found it interesting to present this sophisticated method, which can be potentially more worthwhile in other cases.

## 8. Conclusions

In this article, a general framework for stratification was described for the verification of ensemble forecasts of continuous scalar variables, in the pursuit of a better understanding of forecast behavior. Distinctions were made, on the one hand, between *observation-*, *forecast-*, and *external*-based approaches, depending on where the stratification criterion comes from, and on the other hand between *statistic-* and *meteorology*-oriented strategies, as whether the criterion is function of quantitative outcomes or of meteorological covariates related to physical processes.

The stratification formalism was applied to two widely used verification tools for continuous scalar variables: the CRPS and the rank histogram. For the CRPS, a technique that enables us to easily assess the contribution of each stratum to the overall CRPS has been proposed, which can potentially be applied with any of the abovementioned stratification approaches. However, simply restricting the computation of the average CRPS to a specific subset of the verification sample is problematic in case of an observation-based stratification, as the CRPS is rendered improper. For the rank histogram, past related studies have been extended to the observation-based stratification case, where a mathematical and graphical demonstration showed that a flat histogram over each stratum is not expected with perfectly calibrated forecasts. Therefore, the authors strongly advise others to avoid any observation-based stratification when assessing forecast calibration using the rank histogram. Instead, a forecast-based stratification should be preferred, as it tends to approach the “true” assessment of forecast calibration. Past studies brought to light the risk of a statistical artifact affecting the interpretation of forecast-based stratified rank histograms. We proposed a graphical test, based on the idea of perfect-model assumption, to detect if the user’s targeted stratification can override such an artifact.

The numerical example enabled us to expose insights that can be potentially gained about forecast behavior. In particular, the assessment of calibration has been conducted under a forecast-based stratification using a clustering technique. For the 42–48-h lead time studied, mean areal precipitation forecasts from the ECMWF ensemble prediction system over the 2010–14 period were found generally underdispersive, which is a well-known feature of the ECMWF ensemble prediction system for short lead times. Forecasts generated using an analog method were found much more calibrated, although some bias compensations were observed.

This article is a contribution to the issue of sample stratification, which we believe should be considered more often in the verification process, as a way to limit the risk of missing key aspects of forecast behavior that would average out otherwise. For future work, we encourage other verification tools than the CRPS and the rank histogram to be studied under the stratification framework. Moreover, practical and quantitative guidances about the issue of sample size under stratification are required.

## Acknowledgments

This work has been supported by a Grant from Labex OSUG@2020 (Investissements d’avenir—ANR10 LABX56) and Compagnie Nationale du Rhône. Ensemble forecasts from TIGGE were supplied from ECMWF’s TIGGE data portal. The authors thank Michael Scheuerer for helpful discussion about the CRPS. They also thank Stefan Siegert and an anonymous reviewer for their meticulous reviews that greatly improved the quality of the article.

## REFERENCES

Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations.

,*J. Climate***9**, 1518–1530, doi:10.1175/1520-0442(1996)009<1518:AMFPAE>2.0.CO;2.Bellier, J., I. Zin, S. Siblot, and G. Bontron, 2016: Probabilistic flood forecasting on the Rhone River: Evaluation with ensemble and analogue-based precipitation forecasts.

*E3S Web Conf.*(*FLOODrisk 2016*),**7**, 18011, doi:10.1051/e3sconf/20160718011.Ben Daoud, A., E. Sauquet, M. Lang, G. Bontron, and C. Obled, 2011: Precipitation forecasting through an analog sorting technique: A comparative study.

,*Adv. Geosci.***29**, 103–107, doi:10.5194/adgeo-29-103-2011.Ben Daoud, A., E. Sauquet, G. Bontron, C. Obled, and M. Lang, 2016: Daily quantitative precipitation forecasts based on the analogue method: Improvements and application to a French large river basin.

,*Atmos. Res.***169**, 147–159, doi:10.1016/j.atmosres.2015.09.015.Bontron, G., 2004: Prévision quantitative des précipitations: Adaptation probabiliste par recherche d’analogues. Utilisation des réanalyses NCEP/NCAR et application aux précipitations du sud-est de la France (Quantitative precipitation forecasts: Probabilistic adaptation by analogues sorting. Use of the NCEP/NCAR reanalyses and application to the south-eastern France precipitations). Ph.D. thesis, Institut National Polytechnique Grenoble (INPG), 276 pp. [Available online at https://tel.archives-ouvertes.fr/tel-01090969/document.]

Bröcker, J., 2008: On reliability analysis of multi-categorical forecasts.

,*Nonlinear Processes Geophys.***15**, 661–673, doi:10.5194/npg-15-661-2008.Buizza, R., M. Milleer, and T. Palmer, 1999: Stochastic representation of model uncertainties in the ECMWF ensemble prediction system.

,*Quart. J. Roy. Meteor. Soc.***125**, 2887–2908, doi:10.1002/qj.49712556006.Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable.

,*Quart. J. Roy. Meteor. Soc.***131**, 2131–2150, doi:10.1256/qj.04.71.Candille, G., C. Côté, P. Houtekamer, and G. Pellerin, 2007: Verification of an ensemble prediction system against observations.

,*Mon. Wea. Rev.***135**, 2688–2699, doi:10.1175/MWR3414.1.Casati, B., and et al. , 2008: Forecast verification: Current status and future directions.

,*Meteor. Appl.***15**, 3–18, doi:10.1002/met.52.Efron, B., and R. J. Tibshirani, 1994:

*An Introduction to the Bootstrap.*Chapman Hall/CRC Press, 456 pp.Elmore, K. L., 2005: Alternatives to the Chi-square test for evaluating rank histograms from ensemble forecasts.

,*Wea. Forecasting***20**, 789–795, doi:10.1175/WAF884.1.Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation.

,*J. Amer. Stat. Assoc.***102**, 359–378, doi:10.1198/016214506000001437.Gneiting, T., and R. Ranjan, 2011: Comparing density forecasts using threshold-and quantile-weighted scoring rules.

,*J. Bus. Econ. Stat.***29**, 411–422, doi:10.1198/jbes.2010.08110.Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness.

,*J. Roy. Stat. Soc.***69B**, 243–268, doi:10.1111/j.1467-9868.2007.00587.x.Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts.

,*Mon. Wea. Rev.***129**, 550–560, doi:10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta-RSM short-range ensemble forecasts.

,*Mon. Wea. Rev.***125**, 1312–1327, doi:10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2.Hamill, T. M., and S. J. Colucci, 1998: Evaluation of Eta-RSM ensemble probabilistic precipitation forecasts.

,*Mon. Wea. Rev.***126**, 711–724, doi:10.1175/1520-0493(1998)126<0711:EOEREP>2.0.CO;2.Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: Is it real skill or is it the varying climatology?

,*Quart. J. Roy. Meteor. Soc.***132**, 2905–2923, doi:10.1256/qj.06.25.Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application.

,*Mon. Wea. Rev.***134**, 3209–3229, doi:10.1175/MWR3237.1.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15**, 559–570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.Jolliffe, I. T., and D. B. Stephenson, 2003:

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science*. John Wiley and Sons, 254 pp.Jolliffe, I. T., and C. Primo, 2008: Evaluating rank histograms using decompositions of the chi-square test statistic.

,*Mon. Wea. Rev.***136**, 2133–2139, doi:10.1175/2007MWR2219.1.Lerch, S., T. L. Thorarinsdottir, F. Ravazzolo, and T. Gneiting, 2017: Forecasters dilemma: Extreme events and forecast evaluation.

,*Stat. Sci.***32**, 106–127, doi:10.1214/16-STS588.Marty, R., I. Zin, C. Obled, G. Bontron, and A. Djerboua, 2012: Toward real-time daily PQPF by an analog sorting approach: Application to flash-flood catchments.

,*J. Appl. Meteor. Climatol.***51**, 505–520, doi:10.1175/JAMC-D-11-011.1.Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions.

,*Manage. Sci.***22**, 1087–1096, doi:10.1287/mnsc.22.10.1087.Michelangeli, P.-A., R. Vautard, and B. Legras, 1995: Weather regimes: Recurrence and quasi stationarity.

,*J. Atmos. Sci.***52**, 1237–1256, doi:10.1175/1520-0469(1995)052<1237:WRRAQS>2.0.CO;2.Mullen, S. L., and R. Buizza, 2002: The impact of horizontal resolution and ensemble size on probabilistic forecasts of precipitation by the ECMWF ensemble prediction system.

,*Wea. Forecasting***17**, 173–191, doi:10.1175/1520-0434(2002)017<0173:TIOHRA>2.0.CO;2.Murphy, A. H., 1973: A new vector partition of the probability score.

,*J. Appl. Meteor.***12**, 595–600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.Murphy, A. H., 1995: A coherent method of stratification within a general framework for forecast verification.

,*Mon. Wea. Rev.***123**, 1582–1588, doi:10.1175/1520-0493(1995)123<1582:ACMOSW>2.0.CO;2.Murphy, A. H., and E. S. Epstein, 1967: Verification of probabilistic predictions: A brief review.

,*J. Appl. Meteor.***6**, 748–755, doi:10.1175/1520-0450(1967)006<0748:VOPPAB>2.0.CO;2.Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115**, 1330–1338, doi:10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.Murtagh, F., and P. Legendre, 2014: Wards hierarchical agglomerative clustering method: Which algorithms implement wards criterion?

,*J. Classif.***31**, 274–295, doi:10.1007/s00357-014-9161-z.Obled, C., G. Bontron, and R. Garçon, 2002: Quantitative precipitation forecasts: A statistical adaptation of model outputs through an analogues sorting approach.

,*Atmos. Res.***63**, 303–324, doi:10.1016/S0169-8095(02)00038-8.Park, Y.-Y., R. Buizza, and M. Leutbecher, 2008: TIGGE: Preliminary results on comparing and combining ensembles.

,*Quart. J. Roy. Meteor. Soc.***134**, 2029–2050, doi:10.1002/qj.334.R Development Core Team, 2014:

*R: A Language and Environment for Statistical Computing.*R Foundation for Statistical Computing, Vienna, Austria. [Available online at http://www.R-project.org/.]Schaake, J., and et al. , 2007: Precipitation and temperature ensemble forecasts from single-value forecasts.

,*Hydrol. Earth Syst. Sci. Discuss.***4**, 655–717, doi:10.5194/hessd-4-655-2007.Siegert, S., J. Bröcker, and H. Kantz, 2012: Rank histograms of stratified Monte Carlo ensembles.

,*Mon. Wea. Rev.***140**, 1558–1571, doi:10.1175/MWR-D-11-00302.1.Tabios, G. Q., and J. D. Salas, 1985: A comparative analysis of techniques for spatial interpolation of precipitation.

,*J. Amer. Water Resour. Assoc.***21**, 365–380, doi:10.1111/j.1752-1688.1985.tb00147.x.Talagrand, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction systems.

*Proc. ECMWF Workshop on Predictability*, Reading, United Kingdom, ECMWF, 1–26.Teweles, S., and H. Wobus, 1954: Verification of prognostic charts.

,*Bull. Amer. Meteor. Soc.***35**, 455–463.Thorarinsdottir, T. L., T. Gneiting, and N. Gissibl, 2013: Using proper divergence functions to evaluate climate models.

,*SIAM/ASA J. Uncertainty Quantif.***1**, 522–534, doi:10.1137/130907550.Vrac, M., and P. Yiou, 2010: Weather regimes designed for local precipitation modeling: Application to the Mediterranean basin.

,*J. Geophys. Res.***115**, D12103, doi:10.1029/2009JD012871.Yates, J. F., 1982: External correspondence: Decompositions of the mean probability score.

,*Organ. Behav. Hum. Perform.***30**, 132–156, doi:10.1016/0030-5073(82)90237-9.