• Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9, 15181530, doi:10.1175/1520-0442(1996)009<1518:AMFPAE>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bellier, J., I. Zin, S. Siblot, and G. Bontron, 2016: Probabilistic flood forecasting on the Rhone River: Evaluation with ensemble and analogue-based precipitation forecasts. E3S Web Conf. (FLOODrisk 2016), 7, 18011, doi:10.1051/e3sconf/20160718011.

    • Crossref
    • Export Citation
  • Ben Daoud, A., E. Sauquet, M. Lang, G. Bontron, and C. Obled, 2011: Precipitation forecasting through an analog sorting technique: A comparative study. Adv. Geosci., 29, 103107, doi:10.5194/adgeo-29-103-2011.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ben Daoud, A., E. Sauquet, G. Bontron, C. Obled, and M. Lang, 2016: Daily quantitative precipitation forecasts based on the analogue method: Improvements and application to a French large river basin. Atmos. Res., 169, 147159, doi:10.1016/j.atmosres.2015.09.015.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bontron, G., 2004: Prévision quantitative des précipitations: Adaptation probabiliste par recherche d’analogues. Utilisation des réanalyses NCEP/NCAR et application aux précipitations du sud-est de la France (Quantitative precipitation forecasts: Probabilistic adaptation by analogues sorting. Use of the NCEP/NCAR reanalyses and application to the south-eastern France precipitations). Ph.D. thesis, Institut National Polytechnique Grenoble (INPG), 276 pp. [Available online at https://tel.archives-ouvertes.fr/tel-01090969/document.]

  • Bröcker, J., 2008: On reliability analysis of multi-categorical forecasts. Nonlinear Processes Geophys., 15, 661673, doi:10.5194/npg-15-661-2008.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Buizza, R., M. Milleer, and T. Palmer, 1999: Stochastic representation of model uncertainties in the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 125, 28872908, doi:10.1002/qj.49712556006.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable. Quart. J. Roy. Meteor. Soc., 131, 21312150, doi:10.1256/qj.04.71.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Candille, G., C. Côté, P. Houtekamer, and G. Pellerin, 2007: Verification of an ensemble prediction system against observations. Mon. Wea. Rev., 135, 26882699, doi:10.1175/MWR3414.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Casati, B., and et al. , 2008: Forecast verification: Current status and future directions. Meteor. Appl., 15, 318, doi:10.1002/met.52.

  • Efron, B., and R. J. Tibshirani, 1994: An Introduction to the Bootstrap. Chapman Hall/CRC Press, 456 pp.

    • Crossref
    • Export Citation
  • Elmore, K. L., 2005: Alternatives to the Chi-square test for evaluating rank histograms from ensemble forecasts. Wea. Forecasting, 20, 789795, doi:10.1175/WAF884.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, doi:10.1198/016214506000001437.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and R. Ranjan, 2011: Comparing density forecasts using threshold-and quantile-weighted scoring rules. J. Bus. Econ. Stat., 29, 411422, doi:10.1198/jbes.2010.08110.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243268, doi:10.1111/j.1467-9868.2007.00587.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550560, doi:10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta-RSM short-range ensemble forecasts. Mon. Wea. Rev., 125, 13121327, doi:10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and S. J. Colucci, 1998: Evaluation of Eta-RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev., 126, 711724, doi:10.1175/1520-0493(1998)126<0711:EOEREP>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: Is it real skill or is it the varying climatology? Quart. J. Roy. Meteor. Soc., 132, 29052923, doi:10.1256/qj.06.25.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 32093229, doi:10.1175/MWR3237.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 254 pp.

  • Jolliffe, I. T., and C. Primo, 2008: Evaluating rank histograms using decompositions of the chi-square test statistic. Mon. Wea. Rev., 136, 21332139, doi:10.1175/2007MWR2219.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lerch, S., T. L. Thorarinsdottir, F. Ravazzolo, and T. Gneiting, 2017: Forecasters dilemma: Extreme events and forecast evaluation. Stat. Sci., 32, 106127, doi:10.1214/16-STS588.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Marty, R., I. Zin, C. Obled, G. Bontron, and A. Djerboua, 2012: Toward real-time daily PQPF by an analog sorting approach: Application to flash-flood catchments. J. Appl. Meteor. Climatol., 51, 505520, doi:10.1175/JAMC-D-11-011.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci., 22, 10871096, doi:10.1287/mnsc.22.10.1087.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Michelangeli, P.-A., R. Vautard, and B. Legras, 1995: Weather regimes: Recurrence and quasi stationarity. J. Atmos. Sci., 52, 12371256, doi:10.1175/1520-0469(1995)052<1237:WRRAQS>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Mullen, S. L., and R. Buizza, 2002: The impact of horizontal resolution and ensemble size on probabilistic forecasts of precipitation by the ECMWF ensemble prediction system. Wea. Forecasting, 17, 173191, doi:10.1175/1520-0434(2002)017<0173:TIOHRA>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1995: A coherent method of stratification within a general framework for forecast verification. Mon. Wea. Rev., 123, 15821588, doi:10.1175/1520-0493(1995)123<1582:ACMOSW>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and E. S. Epstein, 1967: Verification of probabilistic predictions: A brief review. J. Appl. Meteor., 6, 748755, doi:10.1175/1520-0450(1967)006<0748:VOPPAB>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 13301338, doi:10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murtagh, F., and P. Legendre, 2014: Wards hierarchical agglomerative clustering method: Which algorithms implement wards criterion? J. Classif., 31, 274295, doi:10.1007/s00357-014-9161-z.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Obled, C., G. Bontron, and R. Garçon, 2002: Quantitative precipitation forecasts: A statistical adaptation of model outputs through an analogues sorting approach. Atmos. Res., 63, 303324, doi:10.1016/S0169-8095(02)00038-8.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Park, Y.-Y., R. Buizza, and M. Leutbecher, 2008: TIGGE: Preliminary results on comparing and combining ensembles. Quart. J. Roy. Meteor. Soc., 134, 20292050, doi:10.1002/qj.334.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • R Development Core Team, 2014: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Available online at http://www.R-project.org/.]

  • Schaake, J., and et al. , 2007: Precipitation and temperature ensemble forecasts from single-value forecasts. Hydrol. Earth Syst. Sci. Discuss., 4, 655717, doi:10.5194/hessd-4-655-2007.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Siegert, S., J. Bröcker, and H. Kantz, 2012: Rank histograms of stratified Monte Carlo ensembles. Mon. Wea. Rev., 140, 15581571, doi:10.1175/MWR-D-11-00302.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Tabios, G. Q., and J. D. Salas, 1985: A comparative analysis of techniques for spatial interpolation of precipitation. J. Amer. Water Resour. Assoc., 21, 365380, doi:10.1111/j.1752-1688.1985.tb00147.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Talagrand, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–26.

  • Teweles, S., and H. Wobus, 1954: Verification of prognostic charts. Bull. Amer. Meteor. Soc., 35, 455463.

  • Thorarinsdottir, T. L., T. Gneiting, and N. Gissibl, 2013: Using proper divergence functions to evaluate climate models. SIAM/ASA J. Uncertainty Quantif., 1, 522534, doi:10.1137/130907550.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Vrac, M., and P. Yiou, 2010: Weather regimes designed for local precipitation modeling: Application to the Mediterranean basin. J. Geophys. Res., 115, D12103, doi:10.1029/2009JD012871.

    • Search Google Scholar
    • Export Citation
  • Yates, J. F., 1982: External correspondence: Decompositions of the mean probability score. Organ. Behav. Hum. Perform., 30, 132156, doi:10.1016/0030-5073(82)90237-9.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • View in gallery

    Location of the 10 considered catchments in France (including the inner catchment that encloses the Rhone River section).

  • View in gallery

    Accumulated stratified CRPS for (left) ECMWF-Ens and (right) ECMWF-Ana forecasts, as a function of lead time. The stratification is done along the observation y.

  • View in gallery

    (top) Graphical interpretation of the conditional probabilities that the observation falls between members and given the ensemble drawn from , in case of forecast calibration. (bottom) As in (top), but in case of an observation-based stratification .

  • View in gallery

    Accumulated stratified rank histograms considering different lead times and ensemble sizes, for ECMWF-Ens forecasts under the perfect-model assumption (one random member is used as the verifying observation) in order to detect deviations from flatness due to a forecast-based stratification. The stratification is done along the ensemble mean, with (light blue), (medium blue), and mm (6 h)−1 (dark blue).

  • View in gallery

    Accumulated stratified rank histograms considering different lead times, for ECMWF-Ens forecasts under the same perfect-model assumption as in Fig. 4 in order to detect deviations from flatness due to an observation-based stratification (). Strata are defined with (light blue), (medium blue), and mm (6 h)−1 (dark blue).

  • View in gallery

    (top) Accumulated stratified rank histogram for ECMWF-Ens forecasts, for the 42–48-h lead time, under a forecast-based stratification using clustering (cf. section 3c). (middle) Individual stratified rank histograms. (bottom) Plots of all forecast distributions populating each stratum. The x axis is in mm (6 h)−1. (a)–(f) The different strata are shown along with their percentage of the total sample size.

  • View in gallery

    As in Fig. 6, but for the ECMWF-Ana forecasts.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 110 110 11
PDF Downloads 102 102 18

Sample Stratification in Verification of Ensemble Forecasts of Continuous Scalar Variables: Potential Benefits and Pitfalls

View More View Less
  • 1 Université Grenoble Alpes, Grenoble INP, CNRS, IRD, IGE, Grenoble, France
  • | 2 Compagnie Nationale du Rhône, Lyon, France
© Get Permissions
Full access

Abstract

In the verification field, stratification is the process of dividing the sample of forecast–observation pairs into quasi-homogeneous subsets, in order to learn more on how forecasts behave under specific conditions. A general framework for stratification is presented for the case of ensemble forecasts of continuous scalar variables. Distinction is made between forecast-based, observation-based, and external-based stratification, depending on the criterion on which the sample is stratified. The formalism is applied to two widely used verification measures: the continuous ranked probability score (CRPS) and the rank histogram. For both, new graphical representations that synthesize the added information are proposed. Based on the definition of calibration, it is shown that the rank histogram should be used within a forecast-based stratification, while an observation-based stratification leads to significantly nonflat histograms for calibrated forecasts. Nevertheless, as previous studies have warned, statistical artifacts created by a forecast-based stratification may still occur, thus a graphical test to detect them is suggested. To illustrate potential insights about forecast behavior that can be gained from stratification, a numerical example with two different datasets of mean areal precipitation forecasts is presented.

© 2017 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Isabella Zin, isabella.zin@univ-grenoble-alpes.fr

Abstract

In the verification field, stratification is the process of dividing the sample of forecast–observation pairs into quasi-homogeneous subsets, in order to learn more on how forecasts behave under specific conditions. A general framework for stratification is presented for the case of ensemble forecasts of continuous scalar variables. Distinction is made between forecast-based, observation-based, and external-based stratification, depending on the criterion on which the sample is stratified. The formalism is applied to two widely used verification measures: the continuous ranked probability score (CRPS) and the rank histogram. For both, new graphical representations that synthesize the added information are proposed. Based on the definition of calibration, it is shown that the rank histogram should be used within a forecast-based stratification, while an observation-based stratification leads to significantly nonflat histograms for calibrated forecasts. Nevertheless, as previous studies have warned, statistical artifacts created by a forecast-based stratification may still occur, thus a graphical test to detect them is suggested. To illustrate potential insights about forecast behavior that can be gained from stratification, a numerical example with two different datasets of mean areal precipitation forecasts is presented.

© 2017 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Isabella Zin, isabella.zin@univ-grenoble-alpes.fr

1. Introduction

Probabilistic forecasts are nowadays widely used in the meteorological community, since they provide a useful estimate of the predictive uncertainty. In an operational context, these forecasts are generally in the form of ensembles representing possible scenarios. Despite progress in the verification field since their emergence, the complexity of their behavior still represents a great challenge for verification practitioners (Casati et al. 2008). In few words, verification is the action of assessing the quality of the forecasts by comparing them to their corresponding observations (Jolliffe and Stephenson 2003). Since a complete picture of the forecast quality cannot be obtained from a single measure, different verification measures have been proposed, which evaluate different attributes (i.e., aspects) of the forecast quality (Murphy 1973). All measures, though, have in common the fact that they require a large number of forecast–observation pairs in the verification sample for being statistically robust. To help increase the sample size, various forecasts may be pooled together (in the same sample), for example, for different locations, for various ranges of predictands, or from different model versions. However, computing a verification measure over an inhomogeneous sample faces the risk of having different forecast behaviors that average out. Stratification, as the process of partitioning the verification sample into different subsets, aims at conditioning the verification measure to specific conditions, so as to minimize this risk and lead to more insightful verification case studies.

It is difficult to trace back the origin of the term stratification, since the concept has probably emerged soon after first meteorological forecasts were verified. Indeed, authors very often present performance measures for different locations or seasons, which is an implicit way of stratifying the complete verification sample. Such an approach aims at making measures of forecast skill independent from the climatological frequency of events that have to be verified, which varies both in space and time (Hamill and Juras 2006). Moreover, modelers are accustomed to conditioning verification case studies to specific meteorological conditions when improving numerical weather prediction models. However, it appears that the term stratification has been mostly used in the literature with the purpose of assessing the significance of different subsets in term of their contribution to the overall verification measure. Murphy (1995), in the first devoted paper, extended his general framework for forecast verification (Murphy and Winkler 1987) to stratification along different meteorological conditions, in the case of probability forecasts of dichotomous events.

This article concentrates on the field of ensemble forecasts of continuous scalar variables. Hereafter, we consider an ensemble as a discrete approximation of a full forecast distribution. This definition encompasses forecasts issued by meteorological ensemble forecasting techniques (Buizza et al. 1999) but also probabilistic forecasts issued by other forecasting techniques such as statistical adaptations, like the analog method (Obled et al. 2002; Hamill and Whitaker 2006), or single-value forecast dressings (Schaake et al. 2007). Two widely used verification measures for ensemble forecasts are the continuous ranked probability score (CRPS) (Matheson and Winkler 1976; Hersbach 2000; Gneiting and Raftery 2007) and the rank histogram (Anderson 1996; Hamill and Colucci 1997; Talagrand et al. 1997). As a measure of forecast calibration, the rank histogram has been subject to stratification in past studies, as advocated by Hamill (2001). He suggests stratification along a statistic of the ensemble in order to detect conditional biases that would be hidden when computing the rank histogram over the whole sample. As a stratification criterion, authors have used the mean and the standard deviation of the ensemble (Hamill and Colucci 1997), or well-correlated quantities (Hamill and Colucci 1998; Bröcker 2008). A substantial contribution to the underlying theory has been made by Siegert et al. (2012), who expressed the risk of statistical artifacts that may affect the interpretation of rank histograms when stratifying along a statistic of a finite-size ensemble. Alternatively, Mullen and Buizza (2002) and, indirectly, Bellier et al. (2016), have stratified rank histograms along the observation. Although Siegert et al. (2012) have mentioned the risk of similar artifacts, theoretical aspects related to the latter approach has, to the knowledge of the authors, not been studied yet.

The rank histogram does not evaluate though how accurate is a forecast. Verification reports of ensemble forecasts very often include the average CRPS as a summary measure of the overall forecast accuracy. Previous contributions have mostly focused on its decomposition into different parts corresponding to specific attributes of the forecast (Hersbach 2000; Bontron 2004; Candille and Talagrand 2005). Only few studies (Gneiting and Ranjan 2011; Lerch et al. 2017) have tackled the CRPS under the stratification approach, by studying the properties of the score when it is averaged over a restricted subset of the verification sample.

In this article, we propose a general stratification framework for ensemble forecasts of continuous scalar variables, and detail different ways of stratifying: along a function of the observation, of the forecast or of an external criterion. Within this framework, a new formulation of the average CRPS is derived. Concerning the rank histogram, the work done by Bröcker (2008) and Siegert et al. (2012) is extended to the problematic case in which stratification is made along a function of the observation, where it is shown that calibrated forecasts do not lead to flat histograms over each stratum. For both the CRPS and the rank histogram, new graphical representations are proposed that synthesize the information coming from stratification into low-complexity charts. To evaluate their potential benefits, two real datasets of probabilistic precipitation forecasts, having similar skills but different behaviors, are verified: ensemble forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF) and analog-derived forecasts, statistically adapted from the ECMWF control forecast.

The article is organized as follows. Observation and forecast datasets are presented in section 2. Section 3 describes the stratification formalism, while sections 4 and 5 detail the application on the CRPS and the rank histogram, respectively. Section 6 presents results from a numerical example. Key points related to stratification are discussed in section 7. Section 8 concludes.

2. Observation and forecast data

Some aspects of the stratification framework benefit from illustrations based on real data. For ease of understanding, these data are presented first. The weather variable of interest (i.e., the predictand) is mean areal precipitation (MAP) accumulated at a 6-h time step over hydrological catchments, in the perspective of hydrological forecasting. Please note that the formalism in sections 3, 4, and 5 applies to any other continuous scalar weather variables.

Figure 1 shows the 10 considered catchments located in France just downstream from Lake Geneva, with areas ranging from 290 to 3760 km2. MAP observations used for verification were processed by Météo-France by kriging hourly and daily rain gauge data from the Météo-France network. The considered period is from 1 January 2010 to 31 December 2014.

Fig. 1.
Fig. 1.

Location of the 10 considered catchments in France (including the inner catchment that encloses the Rhone River section).

Citation: Monthly Weather Review 145, 9; 10.1175/MWR-D-16-0487.1

Two forecast datasets are examined, both coming from the 0000 UTC cycle. The first dataset (labeled as ECMWF-Ens) contains 50-member ensemble forecasts produced by the ECMWF Ensemble Prediction System (EPS) (Buizza et al. 1999) and downloaded from the TIGGE database (Park et al. 2008). Only the perturbed members are considered here. Thiessen-based averaging (Tabios and Salas 1985) has been used to transform grid-based forecasts to MAP forecasts. The second dataset (labeled as ECMWF-Ana) contains 40-member forecasts produced by statistical downscaling of the ECMWF control forecast using an analog method developed successively by Obled et al. (2002), Bontron (2004), Ben Daoud et al. (2011), Marty et al. (2012), and Ben Daoud et al. (2016). In a nutshell, the synoptic forecast situation, characterized by means of large-scale predictors (geopotential height, temperature, and humidity), is taken from the control member. Then, most analog synoptic situations are selected among an archive of reanalyses. Finally, MAP observations having been recorded on these dates are selected and constitute the forecast in the form of an ensemble. Although generally related to EPS, the terms ensemble and member are here used to describe analog-based forecasts as well. More information about this dataset can be found in Bellier et al. (2016).

3. General stratification framework

a. Overview

Stratification, within the verification context, is defined as the action of dividing the sample of historical forecast–observation pairs into different subsets, according to a stratification criterion. The different subsets are called strata (singular: stratum). Hereafter, it is considered as implicit that this is done with the intent of computing performance measures over each of these subsets. The underlying objective is to partition the complete sample into strata that show different behaviors, in order to better understand strengths and weaknesses of the studied forecasting system. In this article, verification concerns forecasts of weather variables that are continuous, such as precipitation or temperature, and scalar (as opposed to multivariate), which presupposes a given location and time. It does not exclude, however, the possibility of pooling in the same sample forecasts from different locations and/or times if it is operationally justified. In previous dedicated studies (Bröcker 2008; Siegert et al. 2012), the formalism has been developed for stratification along a function of the forecast only. What follows is an extension where the criterion may depend on any characteristics of the forecast–observation pair.

Consider a continuous and scalar predictand that has a forecast probability density function (PDF) f or cumulative distribution function (CDF) F. Generally, an operational probabilistic forecast is available in the form of an ensemble of values instead of a full distribution. Thus, consider a finite-size ensemble drawn independently from f and sorted in ascending order:
eq1
where M is the size of the ensemble, assumed to be constant. An ensemble constructed this way is called a Monte Carlo ensemble (Siegert et al. 2012). We suppose that operational forecasts behave as such. Finally, let y denote the verifying observation.
Consider that y is a random variable, is a random vector, and f is a random distribution. Let denote by , , and their different outcomes. Therefore, one can constitute a verification sample:
eq2
which contains N forecast–observation pairs with forecasts in the form of ensembles, supposed to be drawn from the latent (and unavailable) distributions .
Consider further that for each forecast–observation pair within T can be defined a stratification criterion , where is the domain of θ. For example, may correspond to if θ is categorical, or to , , or if θ is a scalar, a vector of size k, or a field of size , respectively. The different outcomes of θ are denoted by . Following Siegert et al. (2012)’s definition, the stratification function
eq3
is the function that maps the criterion θ to one of the S discrete indices corresponding to the different strata. After stratification, the sth stratum contains all forecast–observation pairs satisfying . The S strata are mutually exclusive but collectively exhaustive [i.e., every pair belongs to one and only one stratum], meaning that where is the number of elements of the sth stratum. In this section, the stratification function is described as being either observation, forecast, or external based, depending on the origin of the data the criterion θ is taken as a function of. Possible reasons justifying each of the three approaches are suggested. For the first two, distinction is made between statistic- and meteorology-oriented strategies: on the one hand, θ is a direct function of , while on the other hand θ represents meteorological covariates that are not strictly contained in . Consequently, the meteorology-oriented strategy theoretically permits same elements that would occur on two different days (which is unlikely) to belong to two different strata. We are unaware of any other attempts to classify the different stratification approaches. The following one is a suggestion, which has been found appropriate to support the conclusions we provide about stratified CRPS and rank histograms.

b. Observation-based stratification

A forecaster may wonder: how did the forecasts behave when specific events have occurred? Such question raises the need of an observation-based stratification. Within the statistic-oriented strategy, the criterion is taken as the verifying observation, that is . Then, the stratification function is defined as if and only if , where defines, for each stratum, an interval of ( for precipitation). For example, suppose that one wants to better understand the forecast behavior when heavy rain events (say 30 mm day−1) have occurred. The sample T will then be divided using and , and the second stratum containing the elements satisfying will be carefully examined. Such a stratification approach is applied on the CRPS in the numerical example in section 6.

Within the meteorology-oriented strategy, θ is taken as one or several meteorological covariate(s) of the observation. A typical example is the weather regime, defined as a large-scale spatial atmospheric pattern that has been identified among a finite set of possible ones (Michelangeli et al. 1995; Vrac and Yiou 2010). The information about the observed weather regime is not strictly contained in y, but close links may exist between both. For precipitation for instance, an observed anticyclonic weather regime is strongly associated with outcomes where . In the easiest case, observed weather regimes have, for the N elements of T, already been identified and classified into one of the possible weather regimes. Then refers to a given weather regime and the stratification function can easily be defined. Otherwise, θ contains information about the weather regime, in the form of, for example, a vector of different meteorological variables or a spatial field of a given variable (e.g., geopotential height). In this case, stratification requires the definition of a distance metric, like the Euclidean distance if θ is a vector, or the S1 score (Teweles and Wobus 1954) if θ is a field. Based on the computation of this distance over all couples of , one can classify each observed synoptic situation into a discrete number of classes using a clustering method.

c. Forecast-based stratification

If the question is now: when given forecasts are issued, how do they behave?, a forecast-based stratification is justified. Considering first the statistic-oriented strategy, one can consider the criterion as a numerical statistic of the forecast PDF f from which the ensemble is supposed to be drawn. However, since the latent forecast PDF f is unknown, an estimation from the finite-size ensemble has to be used instead. Siegert et al. (2012) propose for the mean, the median, the spread (as the standard deviation), the interquartile range or the total range between the smaller and the larger ensemble member. The stratification function can then be defined as follow: if and only if where define intervals for each stratum.

Taking the criterion θ as a single statistic of does not ensure, though, that forecasts are similar from the statistical perspective. For example, two ensemble forecasts can have a similar spread but a very different mean. Clustering techniques therefore constitute an alternative approach to gather into same stratum ensemble forecasts that have similar distributions according to a given distance metric. To measure how similar two CDF and are, we propose the use of the integrated quadratic distance (Thorarinsdottir et al. 2013), defined as
e1
which satisfies all axioms of a metric. Its formulation can be seen as an extension of the CRPS as defined later in Eq. (2), where the distribution is no longer a Heaviside function. Discretization is necessary for computation since forecasts and are in the form of ensembles and . Practically, is computed over all couples of , and a clustering method is used to divide the sample T into different strata. Such a stratification approach is applied on the rank histogram in the numerical example in section 6.

Within the meteorology-oriented strategy, forecasts generated by similar meteorological situations are gathered into the same stratum. This is, however, more complicated than in the observation-based case where single meteorological situations are associated with each element of T. Here, if forecasts come from a meteorological EPS, each member is associated with a given meteorological situation. In other words, considering for example that θ corresponds to the weather regime, there are possibly M different weather regimes associated with a given forecast . Different methods should be studied, such as considering as the stratification criterion the most likely weather regime over the M members, or the control’s one. Nevertheless, none of these methods seems entirely satisfactory and this aspect is reserved for future studies.

d. External-based stratification

Finally, one may consider the following: in a given forecasting environment, how do the forecasts behave? Here, the forecasting environment refers to information that is external to the forecasts and observations themselves (either predictand values or meteorological covariates). As a consequence, this approach can be combined with any of the two previously presented. For example, θ can be taken as the location, if spatial disparities are suspected among forecasts for different locations, or as the month of the year, in order to detect seasonal biases in the forecasting model. Note that stratifying along the season is different from along the weather regime (either observed or predicted), the former considering only the forecast date while the latter is a flow-dependent approach. Furthermore, samples of operational forecasts can cover a period that includes one or several model upgrades. To assess their impact on verification measures, one can therefore take θ as the model version. For any of these criterion, can be defined so as to easily construct the stratification function .

As a concluding remark of this section, the classification of stratification approaches we propose can also be viewed under the perspective of the time at which the criterion θ is available. In an observation-based stratification, θ is unknown at the forecast time. In a forecast-based stratification, θ is known at the forecast time since it directly depends on the forecast (either the ensemble itself or some meteorological covariates). Finally, in an external-based stratification the criterion θ is known before the forecast time, as it does not depend on the forecast but only on the forecasting environment discussed above. This perspective is essential for an appropriate usage of stratification in the verification process, as will be discussed in section 7.

4. Application on the CRPS

The CRPS is a verification measure that evaluates the overall accuracy of a probabilistic forecast by estimating the quadratic distance between the CDF of the forecast and the observation (Matheson and Winkler 1976; Hersbach 2000; Gneiting and Raftery 2007). It is defined, for an element n of the verification sample, as
e2
where
eq4
is the Heaviside function. It is negatively oriented, meaning that smaller values are better. Since the forecast is in the form of an ensemble , the formulation in Eq. (2) has to be discretized for computation, as proposed by Hersbach (2000). However, the way are computed does not influence the stratification process. In practice, the CRPS is averaged over a sufficiently large sample T, yielding what we refer to as the overall CRPS:
e3
This quantity can be subject to stratification from two different perspectives.

a. Interpretation of the restricted CRPS

First, we review the interpretation of stratified CRPS from the restriction perspective. As suggested by Lerch et al. (2017), we define the restricted CRPS, denoted by , as the CRPS averaged over elements of the sth stratum:
e4
where is the number of elements in this stratum. It is indeed an appealing approach, especially in the evaluation of the forecast accuracy for extreme events, to compute and interpret over small subsets of elements. However, Gneiting and Ranjan (2011) and Lerch et al. (2017) have shown that unwanted effects may appear. In particular, they have studied the propriety property of the restricted CRPS. As a desirable property, a verification score is proper if it rewards forecasters who issue forecasts that correspond to their true belief, and if it does not suggest any explicit hedging strategy (Gneiting and Raftery 2007). Gneiting and Ranjan (2011) have shown that the restricted CRPS is improper under an observation-based stratification with . Being aware of the stratification function that restricts the observation to a specific stratum, forecasters are encouraged to issue forecasts that emphasize this stratum and therefore differ from their true belief. Numerical examples that evidence this effect can be found in Lerch et al. (2017). However, they note that the restricted CRPS under a forecast-based stratification remains proper. We refer to the above references for more details.

b. Decomposition of the overall CRPS using stratification

In this paper, we rather suggest an approach from a decomposition perspective. Using the stratification function with mutually exclusive but collectively exhaustive strata, one can derive the following expression:
e5
Equation (5) is the underlying equation of the graphical representation we propose in Fig. 2, where the relative contributions of each stratum are colored differently and summed up to . We define this representation as the accumulated stratified CRPS, which enables one to easily assess the significance of each stratum in terms of contribution to the overall CRPS.
Fig. 2.
Fig. 2.

Accumulated stratified CRPS for (left) ECMWF-Ens and (right) ECMWF-Ana forecasts, as a function of lead time. The stratification is done along the observation y.

Citation: Monthly Weather Review 145, 9; 10.1175/MWR-D-16-0487.1

Note that each stratification approach defined in section 3 can potentially be applied on this decomposition, since the equality in Eq. (5) remains true irrespective of . This statement can be extended to any other verification score that is defined for each individual forecast–observation pairs, since Eqs. (4) and (5) are independent of the definition in Eq. (2) of .

5. Application on the rank histogram

The rank histogram (RH) is a diagnostic tool aimed at assessing the calibration of the forecasts (Anderson 1996; Hamill and Colucci 1997; Talagrand et al. 1997). Note that, in the literature, other words for calibration are sometimes used: reliability or statistical consistency. Unlike the CRPS, the RH is constructed on a collective basis, meaning over a sufficiently large set of forecast–observation pairs.

a. Assessing calibration using the rank histogram

As defined by Jolliffe and Stephenson (2003), a forecasting system is calibrated if, and only if, the conditional probability distribution of the observation, given any chosen forecast distribution , is itself equal to :
e6
for all possible . Recall that n is the index of a possible outcome. Since forecasts are in the form of ensembles, Eq. (6) can be formulated as ensemble members ( and the observation y being drawn from the same distribution for each outcome n. In what follows, we express mathematically how the RH verifies this property.
Consider a given forecast distribution from which ensembles are drawn. Let us extend ensembles with fictional bounding ensemble members and such that and . The key point here is that potentially different can be drawn from the same . Then, consider the observation y as a realization from , which is denoted hereafter by . Let for be random variables corresponding to the conditional probability that the observation y falls between members and , given a specific drawn from :
e7
e8
If forecasts are calibrated, is equal to . The observation y is therefore just one more draw from fn. Thus, it is over a large number of different drawn from the same , equally likely to fall within each interval for . As a consequence, since there are intervals, calibration implies
e9
where denotes the expectation over different drawn from the same (Hamill 2001). A graphical interpretation of in case of calibration is proposed in the top panel of Fig. 3, with a given 40-member ensemble outcome. In this example, the values and (represented as shaded area) are different due to the fact that this given is a Monte Carlo ensemble and not a vector of equally spaced quantiles. Nevertheless, both and are equal to .
Fig. 3.
Fig. 3.

(top) Graphical interpretation of the conditional probabilities that the observation falls between members and given the ensemble drawn from , in case of forecast calibration. (bottom) As in (top), but in case of an observation-based stratification .

Citation: Monthly Weather Review 145, 9; 10.1175/MWR-D-16-0487.1

The RH aims at verifying if the equality in Eq. (9) holds by plotting
e10
as a function of , where denotes the number of cases the conditions ⋅ is true. Intervals are called bins, which are said to be populated when falls in. When two or more members within take the same value as , corresponding bins are populated randomly. If forecasts are calibrated [i.e., the equality in Eq. (9) holds] the so-obtained histogram should be flat, apart from fluctuations due the finite size N of the sample. We express such an instance by
e11
which tends to the strict equality as N approaches infinity. A significant nonflatness indicates a miscalibration. The appealing feature of the RH is that one can graphically learn, from the shape of the histogram, where the deficiencies in the forecasting system lie: -shape and -shape indicate under and overdispersion, respectively, while upslope ()-shape and downslope ()-shape indicate negative and positive bias, respectively (Hamill 2001).

According to the definition in Eq. (6) of calibration from Jolliffe and Stephenson (2003), the proper way to assess calibration would be to construct the RH on a sample containing only forecasts drawn from the same distribution , and to repeat the process for all possible . Obviously, this requirement is impossible to fulfill in an operational context. Instead, RHs are generally constructed over samples of forecasts in which distributions differ from each other. Such an overall RH verifies if the equality in Eq. (11) holds on average, while the strict definition of calibration would imply the equality holding for each different distribution. Murphy and Epstein (1967), Yates (1982), and Bontron (2004) have referred to the former definition as in-the-large calibration and to the latter as in-the-small calibration. It is important to highlight that in-the-small calibration implies in-the-large calibration, while the contrary is not true. Thus, flatness of the overall RH is a necessary but not sufficient condition for calibration, as first mentioned by Hamill (2001). If the assessment of in-the-small calibration is in practice infeasible since datasets hardly contain ensemble forecasts from the same distribution, an insight can be obtained with a forecast-based stratification which gathers into same stratum forecasts that are similar.

The present framework for forecast calibration differs from the theoretical framework proposed by Gneiting et al. (2007), although connections between both exist. Gneiting et al. (2007) defined several modes of calibration, namely probabilistic, exceedance, and marginal calibration, with strong calibration when all three hold. Their probabilistic calibration is equivalent to the above-defined in-the-large calibration, and assessed by checking the flatness of the RH constructed over a nonstratified sequence of forecast–observation pairs. Furthermore, they introduced the concept of completeness: complete calibration (regarding one or several modes) is verified if the calibration mode(s) holds for any possible subsequences of forecast–observation pairs. This concept, loosely defined though, shares with the in-the-small calibration definition the idea that the assessment of calibration over a set of forecasts in which distributions differ from each other faces the risk of having different behaviors that average out. The other modes defined by Gneiting et al. (2007), namely, exceedance and marginal calibration, are not considered in our present framework, but it seems reasonable to assume that in-the-small calibration should imply both exceedance and marginal calibration; at least we cannot think of a counterexample.

b. The concept of stratified rank histograms

After having defined theoretical aspects of the RH, consider a stratification along the criterion θ. Equation (7) can then be rewritten as
e12
A RH constructed over the strata s is then represented by
e13
as a function of . For mutually exclusive but collectively exhaustive strata, one can write
e14
for , which is the underlying equation of the graphical representation we propose in Fig. 4. The overall RH is represented as the sum of S stratified RH colored differently. We define this representation as the accumulated stratified RH, which enables one to easily assess the contribution of each stratum to the overall RH.
Fig. 4.
Fig. 4.

Accumulated stratified rank histograms considering different lead times and ensemble sizes, for ECMWF-Ens forecasts under the perfect-model assumption (one random member is used as the verifying observation) in order to detect deviations from flatness due to a forecast-based stratification. The stratification is done along the ensemble mean, with (light blue), (medium blue), and mm (6 h)−1 (dark blue).

Citation: Monthly Weather Review 145, 9; 10.1175/MWR-D-16-0487.1

Stratification of the RH is relevant provided that flatness is expected over each stratum for calibrated forecasts. If this condition is not satisfied, one could hardly infer from the shape of the histograms what comes from the stratification process and what is due to miscalibration of the forecasts. Flatness of stratified RHs requires the equality in Eq. (9) to hold for all . This is obviously true for any external-based stratification, since θ is independent of . However, several pitfalls encompass the forecast- and observation-based stratification cases.

c. Forecast-based stratified rank histograms

Bröcker (2008) and Siegert et al. (2012) have shown that, under calibration, flatness is expected with a stratification criterion or a covariate of the forecast. However, Siegert et al. (2012) have demonstrated that deviations may occur for some cases where , a statistic of f computed from the ensemble . This short paragraph aims at summarizing their findings, but more details and a mathematical demonstration can be found in their paper. The fact that is only an estimation of the true statistic κ causes random sampling errors that are turned into a systematic error when stratifying. To understand, let us imagine a dataset of calibrated ensemble forecasts all drawn from the same distribution (e.g., climatological forecasts), and divided into two same-size strata according to their estimated mean . Because of the random sampling errors , the first stratum will contain more forecasts where , and the second stratum will contain more forecasts where . Since the verifying observation y is also drawn from , the probability is higher (lower) than in the first (second) stratum. This will lead to a -shape (-shape) histogram, although forecasts are perfectly calibrated. When other statistics than the mean are considered, different forms of deviations from flatness in stratified RH are expected, but we refer to Siegert et al. (2012) for more details.

Two factors play a role in this undesirable artifact. The first one is the frequency at which forecasts overlap the bounds delineating the different strata. This is linked to their relative sharpness (i.e., their sharpness compared to their own climatology). Sharp forecasts are less likely to overlap the bounds of the strata. Therefore, the sharper the ensembles are, the weaker the artifact. In the above example, the artifact is maximized by the fact that all ensembles are drawn from the same distribution . Hence, represents the forecast climatology, so the relative sharpness is null and consequently all overlaps the bounds delineating the different strata. The second factor is the ensemble size M. The more members the ensembles have, the smaller random sampling errors are, and as a consequence the weaker the artifact. To completely get rid of such an artifact, Siegert et al. (2012) discuss the possibility of, for each forecast n, splitting randomly each ensemble into two subensembles. The first subensemble would be used to compute . The second subensemble would be considered for the RH and subject to stratification. The disadvantage of this method is that verification is made on forecasts containing less information than the raw forecasts, since only half of the members are considered.

As an alternative to trying to eliminate the artifact, this article proposes a graphical test to evaluate its impact so as to take it into account when interpreting histogram shapes. The objective is to construct stratified RHs for the forecasts to be verified at hand, although under a perfect-model assumption. The procedure is as follows: for each element of T, one member is randomly withdrawn from the ensemble forecast and considered as the new verifying observation. The so-obtained forecasts are perfectly calibrated regarding these pseudo-observations since both forecast members and the pseudo-observation are drawn each time from the same distribution. They also correctly respect the two characteristics of the original forecasts regarding the undesirable artifact. Indeed, the ensemble size is not much changed (only one member less), and the relative forecast sharpness should remain equivalent. Note though that this assumption is reasonable for large ensembles like ECMWF-Ens but might not hold for much smaller ensembles. The second step consists in applying the forecast-based stratification on this dataset. If stratified RHs, for each stratum, do not show any significant deviation from flatness, then no undesirable artifact is likely to occur with the same stratification applied on the original forecast–observations pairs. If discrepancies appear for some strata, they have to be taken into account when interpreting stratified RH back to the original data. If necessary, one can even consider abandoning this stratification. Another interesting point of such a graphical test is that considering the same sample size enables us to graphically assess, a priori, how random fluctuations will affect the interpretation of RH shapes when considering the original data.

Results of an experiment of such a graphical test are given in Fig. 4. Original forecasts are the 50-member ECMWF-Ens MAP forecasts that have been described in section 2. Stratification is done along the ensemble mean with strata defined by , , and [unit: mm (6 h)−1]. Vertically, the effect of the ensemble size is tested, with 49 and 5 (randomly selected) members. Horizontally, the effect of the relative sharpness is tested, by considering different lead times of the forecasts: 18–24 and 114–120 h. Indeed, ensemble forecasts become less sharp as lead time increases, because of the limited predictability of the atmosphere. As expected, all overall RHs are flat, as a consequence of the perfect-model assumption. The top-left histogram does not exhibit any visible deviation from flatness for any of the strata, meaning that the stratification applied on this forecast dataset is relevant regarding the artifact described above. However, one can detect slight slope compensations between the strata when reducing the ensemble size from 49 to 5, which is amplified for the 114–120-h lead time. As a consequence, care must be taken in the interpretation of stratified RH when going back to the original data with such characteristics.

d. Observation-based stratified rank histograms

Although forecast-based stratified RHs are justified for the assessment of in-the-small calibration, observation-based stratified RHs look attractive to answer the question: how did the forecasts behave when specific events have occurred? In the following, we extend the work made by Bröcker (2008) and Siegert et al. (2012) to demonstrate, however, that calibrated forecasts do not lead to flat stratified RHs under an observation-based stratification.

Considering , the stratification function is defined as if and only if , where are intervals defining the S strata. Let us define for . Then, using the definition of truncated distributions, Eq. (12) becomes
e15
e16
where is the truncated PDF defined by
e17
If forecasts are calibrated, is equal to . However, with defined as such strongly depends on i, as evidenced by the graphical interpretation in the bottom panel of Fig. 3. In this specific case, because since , while is higher than in the top panel because . One can then easily figure out that the are not equal for all , due to the fact that, in this case, overlaps the bounds and delineating the strata s.

As a consequence, flat RHs over the different strata are not expected with calibrated forecasts when stratifying along the observation. The sharper ensembles are compared to the climatology of the observation, the weaker ones will be the deviations from flatness. This artifact vanishes as goes to zero and as goes to one (i.e., the ensemble forecast has no chance to overlap and ). Otherwise, stratified RHs will be impacted. Figure 5 shows RHs stratified along the observation for the same perfect-model forecast datasets as in Fig. 4, with 49 members (keep in mind that one random member is used as the verifying observation). To illustrate the effect of the relative sharpness of the forecasts, 18–24- and 114–120-h lead times are considered. We observe that deviations from flatness are well pronounced for both. As will be discussed in section 7, we therefore strongly advise against observation-based stratification when constructing RHs.

Fig. 5.
Fig. 5.

Accumulated stratified rank histograms considering different lead times, for ECMWF-Ens forecasts under the same perfect-model assumption as in Fig. 4 in order to detect deviations from flatness due to an observation-based stratification (). Strata are defined with (light blue), (medium blue), and mm (6 h)−1 (dark blue).

Citation: Monthly Weather Review 145, 9; 10.1175/MWR-D-16-0487.1

6. Numerical example

In this section, we illustrate the potential benefits of stratification through the verification of the two forecast datasets presented in section 2. An observation-based stratification with is first conducted for the analysis of the CRPS, with the objective of characterizing the contribution of different ranges of the predictand to the overall CRPS. Figure 2 shows the accumulated stratified CRPS of ECMWF-Ens and ECMWF-Ana forecasts, as a function of lead time. Forecast–observation pairs for the 10 catchments are pooled together. We observe that the two forecast datasets show a similar overall CRPS, both in terms of amplitude and diurnal cycle. Further insights can however be obtained from stratification. One can graphically assess from Fig. 2 the relative contribution of each precipitation range to the overall CRPS. For example, zero observed precipitation occurs very frequently (, not shown in the paper), but their contribution (lower stratum) to the overall CRPS is small. On the contrary, occurrences with more than 8 mm (6 h)−1 are rare (, not shown), but they contribute to about one-third (higher stratum) of the overall CRPS. Moreover, one can obtain information about the origin of the diurnal cycle. Observed precipitation time series exhibit a small diurnal cycle, which has been found to be exacerbated by ECMWF-Ens members, especially when zero or low precipitation have occurred (not shown). As a consequence, most of the CRPS diurnal cycle in the left chart comes from the two lower strata. Alternatively, ECMWF-Ana forecasts do not amplify the diurnal cycle, since they bypass the thermodynamic process related to the precipitation generation in the atmospheric model. The CRPS cycle in the right chart is therefore mostly explained by the observation cycle, which is stronger for the two higher precipitation strata.

Then, a forecast-based stratification is carried out for the assessment of calibration using RHs. The 42–48-h lead time is here considered. As a preliminary step for both datasets, forecast–observation pairs for the 10 catchments were pooled together since they were found to behave similarly (according to stratified RH within an external-based stratification along the catchments, not shown). This enables us to enlarge the size of the sample T. In this example, N = 18 200 pairs. Then, T is stratified using a clustering technique, with strata, toward the objective of gathering into same stratum forecasts that are similar with regard to their entire distribution, not only their mean, spread, or any other statistic. For clustering, a hierarchical cluster analysis has been conducted, using the integrated quadratic distance [cf. Eq. (1)] as the metric for the dissimilarity between two distributions and the Ward distance (Murtagh and Legendre 2014) as the distance between two clusters. As all other data handling, this process has been done within the R environment (R Development Core Team 2014). The hclust function from the R package stats has been used. The bottom panels of Figs. 6 and 7 show the distributions populating each strata of ECMWF-Ens and ECMWF-Ana forecasts, respectively. To detect if this stratification is subject to a statistical artifact affecting the interpretation of the forecast-based stratified RH, the graphical test proposed in section 5c has been applied (not shown) and no significant deviations from flatness due to stratification are expected.

Fig. 6.
Fig. 6.

(top) Accumulated stratified rank histogram for ECMWF-Ens forecasts, for the 42–48-h lead time, under a forecast-based stratification using clustering (cf. section 3c). (middle) Individual stratified rank histograms. (bottom) Plots of all forecast distributions populating each stratum. The x axis is in mm (6 h)−1. (a)–(f) The different strata are shown along with their percentage of the total sample size.

Citation: Monthly Weather Review 145, 9; 10.1175/MWR-D-16-0487.1

Fig. 7.
Fig. 7.

As in Fig. 6, but for the ECMWF-Ana forecasts.

Citation: Monthly Weather Review 145, 9; 10.1175/MWR-D-16-0487.1

The accumulated stratified RH is represented in the top panels of Figs. 6 and 7. For the sake of ease of interpreting stratified RH shapes, individual RHs for each stratum are also plotted in the middle panel. Several insights about forecast behavior can be obtained from this stratification. First, recall that when several members take the same value as the verifying observation, corresponding bins are populated randomly. It occurs very frequently when dealing with variables such as precipitation that have a point mass in zero. Stratum (a) represents forecasts with all members equal to zero. Nonzero observations can therefore populate the bin only. This stratum collects 29.1% of the forecasts for ECMWF-Ens, but only 4.9% for ECMWF-Ana, which tend to very often have at least few members different from zero. As a significant strength of this analog model though, ECMWF-Ana forecasts are calibrated over (a) while ECMWF-Ens forecasts exhibit a bar higher than it should. Moreover, studying the stratum (a) enables us to appreciate the potentially significant fraction, as for ECMWF-Ens forecasts, of the RH that comes from random process due to zero precipitation. From strata (b) to (f) in Fig. 6, one can conclude that ECMWF-Ens forecasts are generally underdispersive for the 42–48-h lead time in our example. Note that the overall RH already showed a -shape, but only the forecast-based stratification ensures that it really indicates underdispersion and not a combination of -shape and -shape. In addition, we observe that strata (b) and (c) also exhibit a positive bias (-shape) that corresponds to an overestimation of low precipitation, while (d), (e), and (f) do not. Concerning ECMWF-Ana forecasts in Fig. 7, there are significant differences between the overall RH shape and the shapes of the different strata, which can only be observed after stratification. On the one hand, strata (b) and (c) display a positive bias (-shape) in the right side of the RH, which corresponds to the part of the distributions with nonzero members. In other words, the observation is more often as it should equal to zero when forecasts from (b) and (c) are issued. This illustrates a deficiency of this analog model, which has difficulties generating forecasts with a probability of precipitation equal to zero (few nonzero members tend to always be kept). On the other hand, strata (e) and (f) show a tendency to overdispersion, coupled with a negative bias (-shape). This illustrates the fact that, for high precipitation, this analog method tends to still conserve few low-precipitation members.

7. Discussion

a. Should we consider observation-based stratification?

As discussed in section 4 dedicated to the CRPS, previous studies have shown that observation-based stratification is problematic if forecasters need to compute restricted CRPS over specific strata and to interpret them individually from one stratum to another. This would consist, for instance, in using the restricted CRPS for ranking different forecasting systems, or as the objective function of an optimization process within the forecast postprocessing step. For such needs, forecast-based stratification is recommended instead, as the restricted CRPS remains proper. Note also that weighted versions of the CRPS (Gneiting and Ranjan 2011) that emphasize, inside the integral of Eq. (2), a specific region of the predictand’s range are alternative possibilities.

The second approach discussed in section 4 that decomposes the overall CRPS into the contributions coming from the different strata is nonetheless free of theoretical barriers to any stratification strategies. We remind the reader that it is a way to better understand the sensitivity of the overall CRPS to specific subsets of the verification sample, but not to evaluate the forecast accuracy over each subset individually. Possible reasons for advocating an observation-based rather than a forecast-based approach would be, for instance, the desire to learn more about the CRPS behavior on climatological forecasts (widely used as reference forecasts in skill scores), or to ensure same sample sizes in strata when comparing different forecast datasets.

The case of the RH is intrinsically different as it is, unlike the CRPS, constructed and interpreted on a collective basis. It has been shown in section 5 that the stratification process can impact such interpretation, as evidenced by artifacts yielding nonflat RHs constructed with forecasts under the perfect-model assumption. Both observation and forecast-based stratification approaches are concerned, although to different extents. In the former case, the artifact comes from a misuse of the RH as a way to assess calibration. Calibration (or miscalibration) is indeed a forecast property that one wants to be aware of before observations occur. This is the underlying principle of postprocessing, where forecast biases can be identified and conditioned to forecasts or external characteristics so as to be corrected at forecast time. The assessment of calibration has therefore no reason to be conditioned on the future. This would face the risk of drawing erroneous conclusions about forecast behavior. In a past study within a hydrological forecasting context, Bellier et al. (2016) have constructed RH over a sample containing high-flow events selected according to observed peak flow values. Strong -shape was found, but misinterpreted as being a symptom of underestimation bias of the forecasting system while it was mainly caused by the observation-based stratification, which conserved only the high-flow (observed) events. We therefore strongly advise against any observation-based stratification, neither statistic nor meteorology-oriented, when assessing forecast calibration using the RH.

Instead, a forecast-based stratification is perfectly justified as it tends to approach the “true” assessment of forecast calibration by gathering into same stratum forecasts that are similar. The potential artifact in forecast-based stratified RHs is purely statistical and results from the fact that ensemble forecasts have a finite number of members. Its strength thus strongly reduces for large ensembles. Moreover, the graphical test we propose, based on the perfect-model assumption, enables one to assess a priori whether or not the stratification is reasonable. We therefore advocate, as long as care is taken, for forecast-based stratification when computing RHs.

b. Connection between CRPS and rank histogram

Hersbach (2000) has proposed a decomposition of the CRPS into a reliability part and a resolution/uncertainty part. The reliability part is closely connected to the RH. For each bin i, the squared difference between the average frequency that the observation falls below the middle of the bin and the corresponding forecast probability is quantified. The sum of these components yields the reliability part, which should for calibrated forecasts tend to zero as the size N of the sample approaches infinity. It is essential to note, however, that this decomposition does not apply to individual but to the average value , as defined respectively in Eqs. (2) and (4). Hence, applying such a decomposition on a stratified dataset faces the risk of drawing erroneous conclusions about forecast calibration. Particularly, a similar reasoning as in section 5d easily shows that significantly positive values for the CRPS reliability part are expected with observation-stratified forecasts under the perfect-model assumption. As for the RH, we recommend avoiding observation-based stratification when applying Hersbach (2000)’s CRPS decomposition. Concerning a forecast-based stratification, the pattern discussed in section 5c is likely to play a role in case of small forecast ensembles, but we reserve for future studies the quantification of this potential impact.

c. The issue of sample size

As mentioned earlier, the verification of ensemble forecasts requires a sufficiently large sample of forecast–observation pairs. Otherwise, the average CRPS will fluctuate for each pair being added to or withdrawn from the sample and RH’s bins will not be populated enough to correctly interpret the shape. Quantitatively, what sufficiently means is not in the scope of this paper, yet it has been tackled by several authors. Not exhaustively, Candille et al. (2007) propose for the CRPS to account for sample size with bootstrap methods (Efron and Tibshirani 1994). Goodness-of-fit tests for RH flatness exist (Elmore 2005; Jolliffe and Primo 2008), and Bröcker (2008) also suggests the plotting of each RH on a probability paper in order to give quantitative information as whether deviations from flatness are due to sample size or indicate a systematic bias. Nevertheless, it is important to highlight the fact that sample size is the major constraint to the stratification process. It is especially true for the assessment of calibration using forecast-based stratified RHs, which would in theory require a large number of ensemble forecasts drawn from the same distribution. Therefore, a compromise has to be found between the need of strata large enough for a robust verification and the desire to learn more on how forecasts behave. For example, the forecast-based stratification in the numerical example was constrained to six strata due to sample size. With such a restricted number of strata in the case of precipitation, it hardly differs from a stratification along the ensemble mean. Nevertheless, the authors have found it interesting to present this sophisticated method, which can be potentially more worthwhile in other cases.

8. Conclusions

In this article, a general framework for stratification was described for the verification of ensemble forecasts of continuous scalar variables, in the pursuit of a better understanding of forecast behavior. Distinctions were made, on the one hand, between observation-, forecast-, and external-based approaches, depending on where the stratification criterion comes from, and on the other hand between statistic- and meteorology-oriented strategies, as whether the criterion is function of quantitative outcomes or of meteorological covariates related to physical processes.

The stratification formalism was applied to two widely used verification tools for continuous scalar variables: the CRPS and the rank histogram. For the CRPS, a technique that enables us to easily assess the contribution of each stratum to the overall CRPS has been proposed, which can potentially be applied with any of the abovementioned stratification approaches. However, simply restricting the computation of the average CRPS to a specific subset of the verification sample is problematic in case of an observation-based stratification, as the CRPS is rendered improper. For the rank histogram, past related studies have been extended to the observation-based stratification case, where a mathematical and graphical demonstration showed that a flat histogram over each stratum is not expected with perfectly calibrated forecasts. Therefore, the authors strongly advise others to avoid any observation-based stratification when assessing forecast calibration using the rank histogram. Instead, a forecast-based stratification should be preferred, as it tends to approach the “true” assessment of forecast calibration. Past studies brought to light the risk of a statistical artifact affecting the interpretation of forecast-based stratified rank histograms. We proposed a graphical test, based on the idea of perfect-model assumption, to detect if the user’s targeted stratification can override such an artifact.

The numerical example enabled us to expose insights that can be potentially gained about forecast behavior. In particular, the assessment of calibration has been conducted under a forecast-based stratification using a clustering technique. For the 42–48-h lead time studied, mean areal precipitation forecasts from the ECMWF ensemble prediction system over the 2010–14 period were found generally underdispersive, which is a well-known feature of the ECMWF ensemble prediction system for short lead times. Forecasts generated using an analog method were found much more calibrated, although some bias compensations were observed.

This article is a contribution to the issue of sample stratification, which we believe should be considered more often in the verification process, as a way to limit the risk of missing key aspects of forecast behavior that would average out otherwise. For future work, we encourage other verification tools than the CRPS and the rank histogram to be studied under the stratification framework. Moreover, practical and quantitative guidances about the issue of sample size under stratification are required.

Acknowledgments

This work has been supported by a Grant from Labex OSUG@2020 (Investissements d’avenir—ANR10 LABX56) and Compagnie Nationale du Rhône. Ensemble forecasts from TIGGE were supplied from ECMWF’s TIGGE data portal. The authors thank Michael Scheuerer for helpful discussion about the CRPS. They also thank Stefan Siegert and an anonymous reviewer for their meticulous reviews that greatly improved the quality of the article.

REFERENCES

  • Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9, 15181530, doi:10.1175/1520-0442(1996)009<1518:AMFPAE>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bellier, J., I. Zin, S. Siblot, and G. Bontron, 2016: Probabilistic flood forecasting on the Rhone River: Evaluation with ensemble and analogue-based precipitation forecasts. E3S Web Conf. (FLOODrisk 2016), 7, 18011, doi:10.1051/e3sconf/20160718011.

    • Crossref
    • Export Citation
  • Ben Daoud, A., E. Sauquet, M. Lang, G. Bontron, and C. Obled, 2011: Precipitation forecasting through an analog sorting technique: A comparative study. Adv. Geosci., 29, 103107, doi:10.5194/adgeo-29-103-2011.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ben Daoud, A., E. Sauquet, G. Bontron, C. Obled, and M. Lang, 2016: Daily quantitative precipitation forecasts based on the analogue method: Improvements and application to a French large river basin. Atmos. Res., 169, 147159, doi:10.1016/j.atmosres.2015.09.015.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bontron, G., 2004: Prévision quantitative des précipitations: Adaptation probabiliste par recherche d’analogues. Utilisation des réanalyses NCEP/NCAR et application aux précipitations du sud-est de la France (Quantitative precipitation forecasts: Probabilistic adaptation by analogues sorting. Use of the NCEP/NCAR reanalyses and application to the south-eastern France precipitations). Ph.D. thesis, Institut National Polytechnique Grenoble (INPG), 276 pp. [Available online at https://tel.archives-ouvertes.fr/tel-01090969/document.]

  • Bröcker, J., 2008: On reliability analysis of multi-categorical forecasts. Nonlinear Processes Geophys., 15, 661673, doi:10.5194/npg-15-661-2008.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Buizza, R., M. Milleer, and T. Palmer, 1999: Stochastic representation of model uncertainties in the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 125, 28872908, doi:10.1002/qj.49712556006.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable. Quart. J. Roy. Meteor. Soc., 131, 21312150, doi:10.1256/qj.04.71.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Candille, G., C. Côté, P. Houtekamer, and G. Pellerin, 2007: Verification of an ensemble prediction system against observations. Mon. Wea. Rev., 135, 26882699, doi:10.1175/MWR3414.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Casati, B., and et al. , 2008: Forecast verification: Current status and future directions. Meteor. Appl., 15, 318, doi:10.1002/met.52.

  • Efron, B., and R. J. Tibshirani, 1994: An Introduction to the Bootstrap. Chapman Hall/CRC Press, 456 pp.

    • Crossref
    • Export Citation
  • Elmore, K. L., 2005: Alternatives to the Chi-square test for evaluating rank histograms from ensemble forecasts. Wea. Forecasting, 20, 789795, doi:10.1175/WAF884.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation. J. Amer. Stat. Assoc., 102, 359378, doi:10.1198/016214506000001437.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., and R. Ranjan, 2011: Comparing density forecasts using threshold-and quantile-weighted scoring rules. J. Bus. Econ. Stat., 29, 411422, doi:10.1198/jbes.2010.08110.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gneiting, T., F. Balabdaoui, and A. E. Raftery, 2007: Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc., 69B, 243268, doi:10.1111/j.1467-9868.2007.00587.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550560, doi:10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta-RSM short-range ensemble forecasts. Mon. Wea. Rev., 125, 13121327, doi:10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and S. J. Colucci, 1998: Evaluation of Eta-RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev., 126, 711724, doi:10.1175/1520-0493(1998)126<0711:EOEREP>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: Is it real skill or is it the varying climatology? Quart. J. Roy. Meteor. Soc., 132, 29052923, doi:10.1256/qj.06.25.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 32093229, doi:10.1175/MWR3237.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 254 pp.

  • Jolliffe, I. T., and C. Primo, 2008: Evaluating rank histograms using decompositions of the chi-square test statistic. Mon. Wea. Rev., 136, 21332139, doi:10.1175/2007MWR2219.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lerch, S., T. L. Thorarinsdottir, F. Ravazzolo, and T. Gneiting, 2017: Forecasters dilemma: Extreme events and forecast evaluation. Stat. Sci., 32, 106127, doi:10.1214/16-STS588.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Marty, R., I. Zin, C. Obled, G. Bontron, and A. Djerboua, 2012: Toward real-time daily PQPF by an analog sorting approach: Application to flash-flood catchments. J. Appl. Meteor. Climatol., 51, 505520, doi:10.1175/JAMC-D-11-011.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for continuous probability distributions. Manage. Sci., 22, 10871096, doi:10.1287/mnsc.22.10.1087.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Michelangeli, P.-A., R. Vautard, and B. Legras, 1995: Weather regimes: Recurrence and quasi stationarity. J. Atmos. Sci., 52, 12371256, doi:10.1175/1520-0469(1995)052<1237:WRRAQS>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Mullen, S. L., and R. Buizza, 2002: The impact of horizontal resolution and ensemble size on probabilistic forecasts of precipitation by the ECMWF ensemble prediction system. Wea. Forecasting, 17, 173191, doi:10.1175/1520-0434(2002)017<0173:TIOHRA>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1995: A coherent method of stratification within a general framework for forecast verification. Mon. Wea. Rev., 123, 15821588, doi:10.1175/1520-0493(1995)123<1582:ACMOSW>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and E. S. Epstein, 1967: Verification of probabilistic predictions: A brief review. J. Appl. Meteor., 6, 748755, doi:10.1175/1520-0450(1967)006<0748:VOPPAB>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 13301338, doi:10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Murtagh, F., and P. Legendre, 2014: Wards hierarchical agglomerative clustering method: Which algorithms implement wards criterion? J. Classif., 31, 274295, doi:10.1007/s00357-014-9161-z.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Obled, C., G. Bontron, and R. Garçon, 2002: Quantitative precipitation forecasts: A statistical adaptation of model outputs through an analogues sorting approach. Atmos. Res., 63, 303324, doi:10.1016/S0169-8095(02)00038-8.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Park, Y.-Y., R. Buizza, and M. Leutbecher, 2008: TIGGE: Preliminary results on comparing and combining ensembles. Quart. J. Roy. Meteor. Soc., 134, 20292050, doi:10.1002/qj.334.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • R Development Core Team, 2014: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Available online at http://www.R-project.org/.]

  • Schaake, J., and et al. , 2007: Precipitation and temperature ensemble forecasts from single-value forecasts. Hydrol. Earth Syst. Sci. Discuss., 4, 655717, doi:10.5194/hessd-4-655-2007.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Siegert, S., J. Bröcker, and H. Kantz, 2012: Rank histograms of stratified Monte Carlo ensembles. Mon. Wea. Rev., 140, 15581571, doi:10.1175/MWR-D-11-00302.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Tabios, G. Q., and J. D. Salas, 1985: A comparative analysis of techniques for spatial interpolation of precipitation. J. Amer. Water Resour. Assoc., 21, 365380, doi:10.1111/j.1752-1688.1985.tb00147.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Talagrand, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–26.

  • Teweles, S., and H. Wobus, 1954: Verification of prognostic charts. Bull. Amer. Meteor. Soc., 35, 455463.

  • Thorarinsdottir, T. L., T. Gneiting, and N. Gissibl, 2013: Using proper divergence functions to evaluate climate models. SIAM/ASA J. Uncertainty Quantif., 1, 522534, doi:10.1137/130907550.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Vrac, M., and P. Yiou, 2010: Weather regimes designed for local precipitation modeling: Application to the Mediterranean basin. J. Geophys. Res., 115, D12103, doi:10.1029/2009JD012871.

    • Search Google Scholar
    • Export Citation
  • Yates, J. F., 1982: External correspondence: Decompositions of the mean probability score. Organ. Behav. Hum. Perform., 30, 132156, doi:10.1016/0030-5073(82)90237-9.

    • Crossref
    • Search Google Scholar
    • Export Citation
Save