• Anderson, L. J., 1996: Selection of initial conditions for ensemble forecast in a simple perfect model framework. J. Atmos. Sci., 53, 2236.

    • Search Google Scholar
    • Export Citation
  • Bishop, C. H., 2008: Bayesian model averaging’s problematic treatment of extreme weather and a paradigm shift that fixes it. Mon. Wea. Rev., 136, 46414652.

    • Search Google Scholar
    • Export Citation
  • Bröcker, J., and L. A. Smith, 2007: Increasing the reliability of reliability diagrams. Wea. Forecasting, 22, 651661.

  • Fortin, V., A.-C. Favre, and M. Said, 2006: Probabilistic forecasting from ensemble prediction systems: Improving upon the best-member method by using a different weight and dressing kernel for each member. Quart. J. Roy. Meteor. Soc., 132, 13491369.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 10981118.

    • Search Google Scholar
    • Export Citation
  • Hagedorn, R., T. Hamill, and J. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136, 26082619.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550560.

  • Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 32093229.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts. Mon. Wea. Rev., 132, 14341447.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., J. S. Whitaker, and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87, 3346.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 26202632.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., R. Tibshirani, and J. Friedman, 2001: The Elements of Statistical Learning. 1st ed. Springer, 524 pp.

  • Lorenz, E. N., 1996: Predictability—A problem partly solved. Proc. ECMWF Seminar on Predictability, Vol. 1, Reading, United Kingdom, ECMWF, 1–18.

    • Search Google Scholar
    • Export Citation
  • Lorenz, E. N., 2005: Designing chaotic models. J. Atmos. Sci., 62, 15741587.

  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600.

  • Orrell, D., 2003: Model error and predictability over different timescales in the Lorenz96 systems. J. Atmos. Sci., 60, 22192228.

  • Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 11551174.

    • Search Google Scholar
    • Export Citation
  • Richardson, D., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126, 649667.

    • Search Google Scholar
    • Export Citation
  • Roulston, M. S., and L. A. Smith, 2003: Combining dynamical and statistical ensembles. Tellus, 55A, 1630.

  • Schaake, J., T. Hamill, R. Buizza, and M. Clark, 2007: HEPEX: The Hydrological Ensemble Prediction Experiment. Bull. Amer. Meteor. Soc., 88, 15411547.

    • Search Google Scholar
    • Export Citation
  • Smith, L., 2001: Disentangling uncertainty and error: On the predictability of nonlinear systems. Nonlinear Dynamics and Statistics, A. I. Mees, Ed., Birkhauser, 31–64.

    • Search Google Scholar
    • Export Citation
  • Wang, X., and C. H. Bishop, 2005: Improvement of ensemble reliability with a new dressing kernel. Quart. J. Roy. Meteor. Soc., 131, 965986.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2005: Effects of stochastic parametrizations in the Lorenz ‘96 system. Quart. J. Roy. Meteor. Soc., 131, 389407.

  • Wilks, D. S., 2006a: Comparison of ensemble-MOS methods in the Lorenz ‘96 setting. Meteor. Appl., 13, 243256.

  • Wilks, D. S., 2006b: Statistical Methods in the Atmospheric Sciences. 2nd ed. International Geophysics Series, Vol. 91, Academic Press, 648 pp.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2009: Extending logistic regression to provide full-probability-distribution MOS forecasts. Meteor. Appl., 16, 361368.

  • Wilks, D. S., and T. M. Hamill, 2007: Comparison of ensemble-MOS methods using GFS reforecasts. Mon. Wea. Rev., 135, 23792390.

  • Ziehmann, C., 2000: Comparison of a single-model EPS with a multi-model ensemble consisting of a few operational models. Tellus, 52A, 280299.

    • Search Google Scholar
    • Export Citation
  • View in gallery
    Fig. 1.

    Schematic illustration of the analogs of a deterministic forecast method.

  • View in gallery
    Fig. 2.

    Brier skill scores, relative to DMO, for the probabilities of the predictand not to exceed the climatological quantiles (a) q1/10, (b) q1/3, and (c) q1/2. The training data size is n = 1500, the ensemble size is nens = nanalogs = 51, the daughter ensemble size for analog dressing is ndens = 15, and the kernel standard deviation for the kernel estimate is σ = 800.

  • View in gallery
    Fig. 3.

    Reliability diagrams of DMO, logistic regression, and the analog methods with different analogy criteria and training data lengths for q1/10. Reliability diagrams for (a)–(h) T = 1 and (i)–(p) T = 4 are plotted. The solid black lines are the calibration functions with p(y) as the ordinate. The calibration functions of a perfect forecast are plotted as gray dashed lines. The dashed black lines are the refinement distributions with ordinate p(o|y). The error bars illustrate the 90% confidence intervals of the calibration functions. Except for (e) and (m) the training data length is n = 1500. Ensemble size for the dynamical and statistical ensembles is nens = nanalogs = 51, the size of daughter ensembles in (g) and (o) is ndens = 15, and the kernel variance is σ = 800 in (h) and (p).

  • View in gallery
    Fig. 4.

    As Fig. 3, but for q1/3.

  • View in gallery
    Fig. 5.

    RPS (smaller values are better) of the ADF method for different ensemble and training sample sizes for lead times (a) T = 1 and (b) T = 4. The labels in (b) are also valid for (a).

  • View in gallery
    Fig. 6.

    RPS (smaller values are better) of the 3 basic analog methods as function of the “region” size for the rank difference analogy criterion: (a) T = 1 and (b) T = 4. The values on the abscissa specify the number of used neighbors.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 144 62 2
PDF Downloads 89 29 0

Probabilistic Forecasts Using Analogs in the Idealized Lorenz96 Setting

Jakob W. MessnerInstitute of Meteorology and Geophysics, University of Innsbruck, Innsbruck, Austria

Search for other papers by Jakob W. Messner in
Current site
Google Scholar
PubMed
Close
and
Georg J. MayrInstitute of Meteorology and Geophysics, University of Innsbruck, Innsbruck, Austria

Search for other papers by Georg J. Mayr in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

Three methods to make probabilistic weather forecasts by using analogs are presented and tested. The basic idea of these methods is that finding similar NWP model forecasts to the current one in an archive of past forecasts and taking the corresponding analyses as prediction should remove all systematic errors of the model. Furthermore, this statistical postprocessing can convert NWP forecasts to forecasts for point locations and easily turn deterministic forecasts into probabilistic ones. These methods are tested in the idealized Lorenz96 system and compared to a benchmark bracket formed by ensemble relative frequencies from direct model output and logistic regression. The analog methods excel at longer lead times.

Corresponding author address: Jakob Messner, Institute of Meteorology and Geophysics, University of Innsbruck, Innrain 52, Innsbruck, A-6020, Austria. E-mail: jakob.messner@uibk.ac.at

Abstract

Three methods to make probabilistic weather forecasts by using analogs are presented and tested. The basic idea of these methods is that finding similar NWP model forecasts to the current one in an archive of past forecasts and taking the corresponding analyses as prediction should remove all systematic errors of the model. Furthermore, this statistical postprocessing can convert NWP forecasts to forecasts for point locations and easily turn deterministic forecasts into probabilistic ones. These methods are tested in the idealized Lorenz96 system and compared to a benchmark bracket formed by ensemble relative frequencies from direct model output and logistic regression. The analog methods excel at longer lead times.

Corresponding author address: Jakob Messner, Institute of Meteorology and Geophysics, University of Innsbruck, Innrain 52, Innsbruck, A-6020, Austria. E-mail: jakob.messner@uibk.ac.at

1. Introduction

The common method to predict future weather is to run a numerical weather prediction (NWP) model. Because of uncertainties in the initial conditions and imperfection of the model equations (parameterization, approximate numerics, and inadequately understood atmospheric processes), forecasts always exhibit uncertainties. Therefore, for many users, a probabilistic forecast (i.e., predicting a probability distribution) quantifying these uncertainties is superior to a deterministic or “best guess” forecast (e.g., Smith 2001; Richardson 2000).

The most common way to obtain an outline of these uncertainties is to run an ensemble prediction system. One or several deterministic numerical forecast models are integrated several times with slightly perturbed initial conditions. If the perturbed initial conditions represent the uncertainty of the analysis (i.e., approximation of the state of the atmosphere), it is assumed that after integrating the model forward in time, the different model states represent the uncertainty of the forecast. Furthermore, the use of different models (multimodel ensemble) or model equations can take into account model errors (e.g., Ziehmann 2000).

To obtain a predictive distribution from an ensemble, the simplest approach is to take the ensemble relative frequencies as a probability density function. However, it is a well known shortcoming of many ensemble systems that they are underdispersive (e.g., Hamill 2001; Wang and Bishop 2005; Bishop 2008). In this case, probabilistic forecasts made with ensemble relative frequencies are overconfident; the forecasted probability density function is sharper than the density function of truth given the forecast.

As well as improving the ensemble forecasts themselves, statistical postprocessing [model output statistics (MOS)] is a good way to achieve better probabilistic forecasts. For this, an archive of past ensemble forecasts and observations is needed. By utilizing past prediction errors, current forecasts can be corrected.

Operational NWP models are improved frequently. However, for statistical postprocessing, the past forecasts are ideally all made with the same model version. This can be achieved by computing forecasts of past dates with an actual model, initialized with archived analyses. This “reforecast” approach was spearheaded by Hamill et al. (2004) and has since been implemented operationally at the European Centre for Medium-Range Weather Forecasts (ECMWF) and the National Centers for Environmental Prediction (NCEP; Hagedorn et al. 2008).

Several ensemble MOS approaches have been developed and tested in the past few years. Research has been carried out on ensemble dressing (e.g., Roulston and Smith 2003; Wang and Bishop 2005; Fortin et al. 2006), regression (e.g., Hamill et al. 2004; Gneiting et al. 2005), and Bayesian methods (e.g., Raftery et al. 2005; Bishop 2008). Wilks (2006a) and Wilks and Hamill (2007) compared some of these methods theoretically in the simple Lorenz96 model framework (Lorenz 1996) and for a real atmospheric variable (temperature).

In contrast, very simple but quite promising ensemble MOS approaches that use analogs (Hamill et al. 2006; Hamill and Whitaker 2006) have not received as much attention. The basic idea of the analog methods of Hamill and Whitaker (2006) is that if a forecast similar to the current one can be found in the archive, the corresponding observed state of the atmosphere is assumed to be similar to the state to be forecasted.

The aim of this study is to test different configurations of these analog methods and show that although they are quite simple they can keep up with the other ensemble MOS methods. Therefore the analog methods are compared to the most straightforward method using ensemble relative frequencies [i.e., direct model output (DMO)] and to one of the best methods (i.e., logistic regression) in the same simple model setting (Lorenz96) as used in Wilks (2006a). Because of the simplicity of the system, different configurations of the analog methods and their impact on the performance can be tested easily.

The Lorenz96 system is described briefly in section 2. In section 3 the compared methods are described. The results and conclusions are provided in sections 4 and 5, respectively.

2. Dataset

In this section, the dataset used for this study is described. As a relatively simple and inexpensive way to test statistical concepts, the Lorenz96 (Lorenz 1996) model is used in lieu of running an NWP model. The system exhibits chaotic behavior (sensitivity on initial conditions) and has been used as an idealization of weather and weather forecasts in several previous studies (e.g., Lorenz 1996; Orrell 2003; Wilks 2005, 2006a).

a. Model equations

The Lorenz96 model describes a system that consists of two types of variables. A set of a large-scale quantity X is connected to faster, smaller-scale parameters Y. The “true” state of the systems is described by the following two equations:
e1a
e1b
where int[z] rounds z downward to the next integer.

The variables Xk∈{1,…,K} and Yj∈{1,…,JK} are assumed to have cyclic boundary conditions (XK = X0, YJK = Y0) and can thus be interpreted as grid points (in a one-dimensional grid), arranged in a latitudinal circle (Lorenz 2005). Each Xk∈{1,…,K} is connected to J variables with smaller amplitude and frequency [Yj∈{J(k−1)+1,…,kJ}]. The smaller scale can be interpreted as a quantity unresolved in NWP (e.g., convection) while the larger scale represents a parameter that forces the unresolved mechanism (e.g., static instability; Lorenz 1996). The coefficients c and b denote that the smaller-scale variables have a c times higher frequency, while their amplitude is b times smaller. Furthermore, the parameter h describes the strength of the coupling and F is a forcing of the system (Lorenz 1996). The specific parameter values used in the present study are K = 8, J = 32, h = 1, b = 10, c = 10, and F = 20, as used by Wilks (2005, 2006a). The equations are advanced by using a fourth-order Runge–Kutta integration scheme with a time step of 0.0001 time units.

While Eqs. (1a) and (1b) are used to compute the true state of the system, the following equations:
e2a
e2b
simulate a forecast (Wilks 2005). In addition to the parameterization of the Y variables [Eq. (2b)], the numerical integration accuracy of Eq. (2a) is deteriorated through using a lower-order (second) scheme with a coarser integration time step of 0.005 time units. Furthermore, perturbations of the “truth” are used as analyses (more information about this perturbation can be found in the appendix). Thus, the main causes for forecast errors in operational NWP, model, and initial state inaccuracies, are simulated. Wilks (2006a) also discovered that ensemble forecasts made by Eq. (2a) exhibit underdispersion similar to many operational NWP ensemble forecasts.

b. Training and test data

To test the various methods a training dataset and a test dataset are needed. Both include analyses and ensemble forecasts.

As training dataset, a “historical database” of analyses was created by integrating Eqs. (1a) and (1b) sampling every 0.15 time units. The 0.15 time units correspond to a time spacing with a lag-1 autocorrelation of 0.5 for each of the X variables, roughly equivalent to 1 day in the real atmosphere (Wilks 2006a). Thus, a sequence of daily analyses is simulated. The integration was performed over 10 000 “days.” To test the influence of the archive size, samples of this dataset with lengths n = 50, 100, 200, 500, 1500, and 3500 are also used.

Additionally, a test dataset of size n = 10 000 was created, again by integrating the model Eqs. (1a) and (1b) sampling 10 000 sets of 6 points, the 6 points spaced by 1 time unit, respectively. These sets are separated by 50 time units, respectively, to ensure that they are independent.

Ensembles of sizes nens = 5, 10, 25, 51, and 100 were initialized for each of the 20 000 initial points (10 000 for the training data and 10 000 for the test data). More information about the ensemble initialization is provided in the appendix. Ensemble forecasts were simulated by integrating Eqs. (2a) and (2b) sampling every time unit (T = 1, 2, … , 5). This is equivalent to sampling approximately once a week in the real atmosphere. The ensemble mean is used in lieu of the deterministic or best-guess forecast. Probabilistic forecasts are made for six categories separated by five quantiles of the climatological distribution of the predictand Xk∈{1,…,K}. The climate is defined by the 10 000 points of the training dataset and the specific values of the quantiles are: q1/10 = −2.867, q1/3 = 1.2886, q1/2 = 3.5338, q2/3 = 6.0279, and q9/10 = 10.9403.

3. Forecast methods

In this section, three simple forecast methods using analogs are presented. Because a similarity measure is needed for all of them, a discussion about the analogy criterion is also included here. Additionally, the two “traditional” ensemble MOS methods, namely, direct model output and logistic regression are described.

a. Analog methods

For the analog methods it is supposed that in addition to an actual NWP forecast an archive of reforecasts is available. It would also be possible to take an archive of past forecasts if model changes in the archived forecasts do not change the statistical characteristics of the model.

1) Analogy criterion

One of the main problems of the analog methods is to find an appropriate analogy criterion. If an infinite set of reforecasts were available, a nearly identical reforecast to the current one could be found. Here identical means that the same values of all meteorological parameters are forecasted on all grid points. For real weather forecasts, because of the high resolution, the high dimensionality, and a limited database, there is a low chance of finding similar states, especially globally. However, some approximations can be made in order to obtain enough meaningful analogs. For instance, smaller regions or fewer variables can be considered (Hamill and Whitaker 2006). In contrast, only one resolved variable exists in the Lorenz96 model. However, even in this simple model, there exist infinite possibilities of testing the analogy (e.g., weighting the grid points). To retain simplicity and generality, only two simple analogy criteria were tested:

  1. a simple root-mean-square (rms) of the differences between (forecast) and the (reforecast) to be compared.
    e3
    where X* and are K-dimensional vectors with elements and . Smaller values of Ψ signify greater similarity.
  2. Hamill and Whitaker (2006) found that in their data (precipitation forecasts) the rms criterion led to an underforecasting bias. A similar problem is also present in the Lorenz96 model. Because of the approximately Gaussian distribution of the predictand X, it can be assumed that generally more analogs are found that are closer to the climatological mean. Thus, forecasts using these analogs are shifted toward the climatological mean. As a solution, Hamill and Whitaker (2006) proposed an analogy criterion that operates with rank differences. For each grid point k, the rank of the current values is derived when they are pooled with N (training data size) historical values This rank of the current value is then compared with the ranks of the historical values in the whole archive. The smaller the sums of absolute rank differences, the more similar are X* and :
    e4
    Both analogy criteria can also be applied locally to nr neighbor grid points of the grid point to be forecasted. Because the effects of using fewer grid points are similar for both criteria, results are only shown for the locally applied rank difference criterion:
    e5
    If, for example, a forecast for X3 has to be made, the sum of rank differences between and (two neighbors) can be used as the analogy criterion. Thus, the effect of different “regions” is tested. Region sizes (number of neighbors) of nr = 0, 2, 4 were used. The rank difference analogy criterion with “region size” 4 is termed rankdiff4 in subsequent figures.

2) Analogs of a deterministic forecast

For this method, which is similar to the one proposed by Hamill et al. (2006), first a current deterministic NWP forecast for lead time t is compared with forecasts for the same lead time in the reforecast archive. Then the nanalogs historical dates with the most similar reforecasts are extracted together with their corresponding recorded analyses. These analyses form an nanalogs-member ensemble.1 Figure 1 illustrates this method schematically. Probabilistic forecast are then made by using ensemble relative frequencies:
e6
where Pr(Vq) is the probability of the verification V to be smaller than or equal to the threshold q, and Rank(q) specifies the rank of q in the ensemble [Rank(q) = 1 if all ensemble members are greater than q and Rank(q) = nanalogs + 1 if all ensemble members are smaller than q]. The adjustments +⅓ and −⅓ (Tukey plotting position) are used to avoid extreme probabilities of 0 and 1 (Wilks 2006b). Thus, the cumulative probability Pr(Vq) can take on values between 2/(3nanalogs + 4) and (3nanalogs + 2)/(3nanalogs + 4).
Fig. 1.
Fig. 1.

Schematic illustration of the analogs of a deterministic forecast method.

Citation: Monthly Weather Review 139, 6; 10.1175/2010MWR3542.1

Different analogs are used for each lead time. It is therefore not guaranteed that forecasts are temporally consistent. Furthermore for this approach it is assumed, that the archived analyses represent the true historical state of the system. However, observations and analyses are also subject to errors and their accuracy affects the accuracy of the forecasts made with this method.

3) Kernel estimate

In the analogs of a deterministic forecast (ADF) method, the analyses of the nanalogs most similar forecasts are supposed to all have the same probability of occurrence. For the kernel estimate (e.g., Hastie et al. 2001) it is assumed that the probability of an event in the archive to recur diminishes with the distance (analogy criterion) between the reforecast and the current forecast. These probabilities can be estimated by a kernel function, , where σ (kernel standard deviation) defines the width of the kernel function and is the analogy criterion of the forecast X* compared to the reforecast corresponding to the ith event in the archive. Using Bayes’s theorem a probabilistic forecast can be made with
e7
where I(·) is the indicator function (1 if the argument is true, and 0 if it is not) and N is the number of events in the training dataset. Here, the Gaussian density function was chosen as kernel K[Ψ(·|·), σ] = ϕ[Ψ(·|·), 0, σ]. Several different kernel standard deviations σ were tested and the one with the best ranked probability score (see section 4) averaged over all lead times was chosen.

4) Analog dressing

As suggested by Roulston and Smith (2003), a combination of a dynamical ensemble (produced with a NWP model) and a statistical ensemble unites the advantages of both. Whereas they used a best-member approach, using the analogs of a deterministic forecast is an alternative way to form statistical daughter ensembles with which each member of the dynamical ensemble is “dressed.” Hamill and Whitaker (2006) already tested this method for precipitation forecasts over the United States.

For a current +t ensemble forecast with nens members, a meteorological archive is scanned for analog +t forecasts to each member for forecast time step +t. Thus, each member of the ensemble forecast is dressed with a daughter ensemble formed by the corresponding analyses of the ndens most similar forecasts in the training data. The resulting ensemble with nens × ndens members should then be drawn from a similar distribution as the truth. For the Lorenz96 model, where all members are statistically indistinguishable (Wilks 2006a),2 forecasts of all members can be compared among each other. If the members are initialized or computed differently (e.g., multimodel ensembles), only historical forecasts of the member itself can be used.

Like the approaches of Wang and Bishop (2005) and Roulston and Smith (2003), this method is only appropriate for underdispersive ensembles. However, sorting the members according to their forecasted value and weighting them, as proposed by Fortin et al. (2006), could make the method also usable for overdispersive ensembles.

b. Reference approaches

As reference for the approaches using analogs, DMO (ensemble relative frequencies) and logistic regression are used. DMO is commonly used in operational weather forecasting and is utilized here to define the lower skill boundary of ensemble MOS methods (Wilks 2006a). Any reasonably efficient MOS method should be superior to this method. Logistic regression, on the other hand, was identified to be one of the best methods for probabilistic forecasting in several studies (e.g., Wilks 2006a, 2009; Wilks and Hamill 2007) and is thus used as upper skill boundary of the traditional ensemble MOS methods.

1) Direct Model Output

It is assumed that the distribution of the DMO ensemble reflects the probability distribution of the predictand. As with the ADF method, the Tukey plotting position [Eq. (6) with nanalogs replaced by nens] is used here to compute the probabilities.

2) Logistic regression

The probability of a binary predictand (two categories) to fall in one category can be estimated by using logistic regression (Wilks 2006a). Hamill et al. (2004) used only the ensemble mean as predictor, because they found the correlation between ensemble mean error and ensemble spread to be low in their data. However, for the Lorenz96 system this spread–skill relationship is sufficiently strong to increase the forecast accuracy of this method when the ensemble standard deviation is added as a second predictor (Wilks 2006a). The distribution function is given by (Wilks 2006a,b)
e8
The parameters b0, b1, and b2 are computed by maximizing the log likelihood function:
e9
The regression parameters have to be fitted separately for each lead time and quantile for which forecasts are to be made. A possible extension of this method was presented by Wilks (2009), where the forecast threshold is used as additional predictor. Thus, regression parameters only have to be fitted once.

4. Results

In this section, the results of the different methods are presented and discussed. Because all methods provide probabilistic forecasts, verification measures that can quantify their prediction quality in a probabilistic sense are used.

To give an overview over the performance of the different methods the ranked probability score (RPS; Wilks 2006b) is used. The RPS is computed for the disjoint categories −∞ to q1/10, q1/10 to q1/3, q1/3 to q1/2, q1/2 to q2/3, q2/3 to q9/10, and q9/10 to ∞. Table 1 shows the RPS of selected analog methods, DMO, and logistic regression as skill score relative to the climatological probabilities . The ranked probability skill scores are accompanied by estimates of their error. These error estimates are computed through where σ is the standard deviation of the ranked probability skill score and N is the number of events in the test dataset. These error estimates can be used to estimate the significance of the differences between two methods [see, e.g., section 5.2.2 in Wilks (2006b)]. Because the comparison groups are not independent this is only a rough estimation.

Table 1.

Ranked probability skill scores relative to the climatology for the six-category probabilistic forecasts defined by the climatological quantiles q1/10, q1/3, q1/2, q2/3, and q9/10. Ranked probability skill scores are shown for DMO, logistic regression, ADF, analog dressing, and the kernel estimate. Terms in parentheses denote whether rank difference (rankdiff) or RMS were used as analogy criteria. The best method for each lead time is set in boldface. The subscripts are estimates for the error of the ranked probability skill scores. The ensemble size is nens = nanalogs = 51, the daughter ensemble size is ndens = 15, and the kernel standard deviation is σ = 15 for RMS, σ = 800 for rankdiff, and σ = 600 for rankdiff4. The training data length is n = 1500.

Table 1.

Therefore a t test for paired samples (see, e.g., section 5.2.3 in Wilks 2006b) was performed to show that the RPS differences between the best and second-best method are significant at a 0.05 level for all lead times.

Overall, the analog methods score relatively poorly at short lead times (T = 1). However, for longer forecast periods (T = 3, 4, 5), they all improve significantly over DMO and even have scores comparable to or better than logistic regression. Like logistic regression they outperform climatology over the whole forecast period. For shorter forecast leads, analog dressing performs best among the analog methods, while for longer lead times, the kernel estimate achieves the best RPS of all tested approaches. It should be mentioned here that better scores than DMO at T = 1 can be achieved for the analog methods when adjusting the number of analogs (size of statistical ensemble), training data length, and region size for the analogy criterion.

Regarding the analogy criteria, the rank difference criterion is considerably superior to the RMS similarity measure for ADF and the kernel estimate while the differences are small for analog dressing. With the smaller region (rankdiff4), ADF and the kernel estimate can be improved additionally except at T = 1 for the kernel estimate. For the analog dressing method, the use of fewer grid points also improves the scores at shorter lead times (T = 1, 2, 3) whereas it slightly worsens the skill scores for longer lead times (T = 4, 5).

The Brier score (BS; Wilks 2006b) provides more detailed information on the performance of the different methods. Figure 2 shows the Brier skill scores of selected methods relative to DMO for (Fig. 2a) Pr{Vq1/10}, (Fig. 2b) Pr{Vq1/3}, and (Fig. 2c) Pr{Vq1/2}. As with the RPS measure, the analog approaches perform quite well at longer lead times and relatively poorly for shorter forecast ranges.

Fig. 2.
Fig. 2.

Brier skill scores, relative to DMO, for the probabilities of the predictand not to exceed the climatological quantiles (a) q1/10, (b) q1/3, and (c) q1/2. The training data size is n = 1500, the ensemble size is nens = nanalogs = 51, the daughter ensemble size for analog dressing is ndens = 15, and the kernel standard deviation for the kernel estimate is σ = 800.

Citation: Monthly Weather Review 139, 6; 10.1175/2010MWR3542.1

In comparison to DMO and logistic regression, all analog methods are slightly less suitable for rarer (Vq1/10) than for common (Vq1/2) events. This might be because in the training data these events appear less frequently. Therefore, the analogs found for extreme (rare) events are generally less similar to each other than for common events. This leads to an overdispersive ensemble and hence to underconfident forecasts for these events.

Differing from single number scores such as the BS or the RPS, reliability diagrams can show the full joint distribution of forecasts and observations (Wilks 2006b). In Figs. 3 and 4, the reliability diagrams of selected methods are plotted for different lead times and forecast thresholds. The 5%–95% confidence intervals, which are estimated from a 1000-member bootstrap sample (Hamill et al. 2008; Bröcker and Smith 2007), are placed on the calibration functions (thick lines). Furthermore, the reliability (REL) and resolution (RES) term of the algebraic decomposition of the Brier score (Wilks 2006b; Murphy 1973) are displayed in the diagrams. The Brier score can be computed as the sum of REL, −RES, and an uncertainty term. Hence, small REL and large RES characterize good probabilistic forecasts. Geometrically, small values of reliability signify a small weighted (by the refinement distribution: thin line) average of squared vertical distances between the calibration function (heavy line) and the 1:1 reference line (dashed line). The resolution term is the weighted average difference between the calibration function and the overall climatological relative frequency (=0.10 for q1/10 and =0.33 for q1/3; Wilks 2006b).

Fig. 3.
Fig. 3.

Reliability diagrams of DMO, logistic regression, and the analog methods with different analogy criteria and training data lengths for q1/10. Reliability diagrams for (a)–(h) T = 1 and (i)–(p) T = 4 are plotted. The solid black lines are the calibration functions with p(y) as the ordinate. The calibration functions of a perfect forecast are plotted as gray dashed lines. The dashed black lines are the refinement distributions with ordinate p(o|y). The error bars illustrate the 90% confidence intervals of the calibration functions. Except for (e) and (m) the training data length is n = 1500. Ensemble size for the dynamical and statistical ensembles is nens = nanalogs = 51, the size of daughter ensembles in (g) and (o) is ndens = 15, and the kernel variance is σ = 800 in (h) and (p).

Citation: Monthly Weather Review 139, 6; 10.1175/2010MWR3542.1

Fig. 4.
Fig. 4.

As Fig. 3, but for q1/3.

Citation: Monthly Weather Review 139, 6; 10.1175/2010MWR3542.1

The calibration function (heavy line) of DMO is less steep than the 1:1 reference line for both quantiles and lead times (Figs. 3a,i and 4a,i). This overconfidence (Wilks 2006a,b) results from the underdispersion of the ensemble and is also reflected in the higher values of reliability (REL). The opposite is true for the analog methods at threshold q1/10 and T = 1. The calibration functions steeper than the 1:1 line show the problem of overdispersion of these methods as mentioned above. Using a larger training dataset reduces this problem (Fig. 3e), because more close analogs can be found. The reliability for q1/10 can be improved by using fewer analogs [smaller ensemble; e.g., for the ADF (RMS) method rel = 0.0001 with nanalogs = 10, not shown] because the predictive distribution becomes narrower. However, for q1/3 taking fewer analogs worsens the reliability, especially for longer lead times [e.g., rel = 0.0072 for ADF (RMS) at T = 4 with nanalogs = 10, not shown] because then the forecasts become overconfident (underdispersion of the ensemble).

Regarding the ADF method, the differences of the Brier skill scores between the two types of analogy criteria (RMS and rank difference) are considerably smaller for q1/10 than for the other quantiles. This implies that the rank difference analogy criterion aggravates overdispersion somewhat (Figs. 3c,d,k,l). When the rank difference criterion is used generally the same amount of analogs has to lie above and below the raw forecast. This means that the rank difference criterion partially forces the analog historical forecasts to be more extreme than the current one. Because fewer extreme events exist in the training data the found analogs are less similar.

Comparing the reliability diagrams of the analog methods, DMO, and logistic regression at T = 4 (Figs. 3 and 4), one can see that for the analog methods overall the reliability is comparable or better (smaller) than that of logistic regression while the resolution is generally better (larger) for logistic regression and DMO. Bad resolutions signify that the forecasts poorly discern between different events (Wilks 2006b). As an extreme example, the climatological relative frequencies are the same for all forecast occasions and achieve a reliability of 0 (perfect) but only a resolution of 0 (worst). Better values of resolution can be achieved if the rank difference or the four-neighbor rank difference analogy criterion (rankdiff4) is used. With the RMS criterion, the forecasts are typically shifted toward the climatological mean. They thereby converge slightly toward the climatological relative frequency, which can explain the worse resolution. With the four-neighbor rank difference criterion, the analogs found are locally closer, which improves the resolution.

With analog dressing the statistical ensemble is dressed around the individual members of the dynamical ensemble. Since the raw dynamical ensemble performs quite well at T = 1, 2, this method is best among the analog methods for these lead times, however, still worse than DMO at T = 1. Better RPS than DMO at T = 1 can be achieved using a larger dataset (n = 3500, 10 000) or smaller regions for the analogy criterion (0- and 2-neighbor rank difference analogy criterion, not shown). For longer forecast periods, the ensemble mean (ADF, kernel estimate) contains sufficient information, and using the entire ensemble (analog dressing) even worsens the performance.

Figure 5 shows the effects of different training sample and ensemble (of analogs) sizes for the ADF method. As expected, there is a strong dependency on the training data length at T = 1, whereas the influence of the ensemble size is weak. However, at longer lead times, the performance of this method is affected strongly by the ensemble size, while the dependency on the training sample size is not clear. If the training dataset is short, an increase improves the RPS. However, if the dataset is already large enough (N > 500), a further enlargement even worsens the score. The reliability stays constant while the resolution shrinks (Figs. 3k,m and 4k,m). A t test, however, shows that this worsening is statistically not significant at a level of 0.05.

Fig. 5.
Fig. 5.

RPS (smaller values are better) of the ADF method for different ensemble and training sample sizes for lead times (a) T = 1 and (b) T = 4. The labels in (b) are also valid for (a).

Citation: Monthly Weather Review 139, 6; 10.1175/2010MWR3542.1

Finally, Fig. 6 depicts the effects of different region sizes to which the rank difference analogy criterion is applied. The optimal standard deviation of the kernel function depends strongly on the region size. This is obvious, because if fewer grid points are considered, the absolute values of the analogy criteria shrink. Therefore the optimal kernel standard deviation was used for each region size. These are 75, 200, 400, and 800 for 0, 2, 4 neighbors, and all grid points, respectively.

Fig. 6.
Fig. 6.

RPS (smaller values are better) of the 3 basic analog methods as function of the “region” size for the rank difference analogy criterion: (a) T = 1 and (b) T = 4. The values on the abscissa specify the number of used neighbors.

Citation: Monthly Weather Review 139, 6; 10.1175/2010MWR3542.1

At T = 1 the 0-neighbors rank difference analogy criterion works best for all three analog methods. This is evident, because then the forecasts made with these methods display the highest similarity to DMO. Since DMO generally performs better than these approaches at T = 1, this leads to improved scores. However, at T = 4, the results achieved with the 0-neighbor criterion are worse and best results are achieved by applying the analogy criterion to all grid points.

5. Summary and conclusions

In this study, three methods of probabilistic weather forecasting were tested and compared in the idealized Lorenz96 model (Lorenz 1996). Two of these, here called analogs of a deterministic forecast (ADF) and analog dressing, were already proposed and applied to precipitation forecasts over the United States by Hamill and Whitaker (2006). Additionally, the ADF approach was expanded by weighting the analogs dependent on the closeness of their analogy. Because a kernel function is used for the weighting, this method was called kernel estimate.

The analog methods were benchmarked against two approaches commonly used to provide probabilistic forecasts through statistically postprocessing ensemble forecasts. Logistic regression was chosen, because Wilks (2006a) showed that it generally performs best in the Lorenz96 system. Since direct model output (DMO) is the method often used in operational weather prediction it was compared as well. An overview of the performance of all methods is given in Table 1.

The promising results of using ADF and analog dressing for real-world precipitation forecasts (Hamill and Whitaker 2006) are similar in the idealized Lorenz96 model. These methods perform very well for long forecast ranges. In contrast, they perform worse than direct model output and logistic regression for short lead times. Also similar to Hamill and Whitaker (2006), analog dressing is slightly better than ADF at shorter lead times while the ADF approach gives better forecasts for longer prediction periods.

For longer forecast times, the kernel estimate overall performs best. At short lead times analog dressing, DMO, and logistic regression are better than this method. To improve the skill for early time steps, an expansion of this method similar to the expansion of ADF to analog dressing is feasible. Instead of the ADF method, the kernel estimate could be applied separately to each member.

A crucial step of analog forecast methods is the identification of past weather events with similar forecasts to the current one. All three methods showed better results when the rank difference analogy criterion (Hamill and Whitaker 2006) was used instead of a simple root-mean-square measure. The differences are especially large for ADF and the kernel estimate. The region over which similarity is tested should increase with forecast time. For all analog methods, fewer grid points for the criteria are better at short lead times. For longer forecast periods, a medium number of grid points showed best skills for the ADF and kernel estimate approaches, while the use of all grid points worked best for analog dressing. The dependence of the partly large performance differences on the analogy criteria highlights the potential importance of choosing an appropriate criterion for real weather forecasting.

Training data sizes equivalent to 50 to 10 000 days were tested. Using a larger training dataset improves the analog methods at short lead times. For longer lead times, however, the approaches did not show better results when a larger training dataset was used.

A positive feature of all tested methods except DMO, which does not appear in the Lorenz96 model, is that they can be used to convert NWP forecasts to calibrated forecasts for point locations of weather stations. An advantage of the analog methods over other ensemble MOS methods such as logistic regression is that they provide a discrete sample from a calibrated distribution. Discrete samples are needed to drive a hydrological ensemble (Schaake et al. 2007).

The Lorenz96 system is a very simple idealization of the atmosphere and results obtained here cannot be transfered directly to real NWP forecasts. However, two facts suggest that the results presented above might also have relevance for the true atmosphere. First, in Wilks (2006a) and Wilks and Hamill (2007) ensemble MOS methods are tested on the one hand in the same model setting than used here (Wilks 2006a) and on the other hand for a real atmospheric variable (temperature; Wilks and Hamill 2007) and both papers show results that are very similar to each other. Second, the results from Hamill and Whitaker (2006), who tested some analog methods for precipitation forecasts, agree with the results of the present study.

Acknowledgments

This research was supported by the Austrian Science Fund (FWF) L615-N10. We thank Steve Mullen, Tom Hamill, and two anonymous reviewers for their very constructive comments.

APPENDIX

Initialization of Ensembles

This appendix provides a detailed explanation of the ensemble initialization in the Lorenz96 model as proposed by Wilks (2005).

Following Anderson (1996), in a good initialization, ensemble members need to lie on the attractor of the system. Investigation of the structure of the attractor can be very difficult. However, an extremely long integration of the model can reveal the nature of the attractor (Anderson 1996). Therefore, Eq. (1) was integrated long enough to find at least 100 analogs for all of the 20 000 initial points (10 000 for the training data and 10 000 for the test data). In this context an analog is found when each Xk,analog lies within an interval of 5% of the climatological range centered on the Xk of the initial point (Wilks 2005). The covariance matrices of the initial ensemble distribution are computed through
ea1
where is the (K × K) covariance matrix for the analogs that is computed separately for each initial point and λk are the eigenvalues of . The matrices have the same correlations and eigenvectors as , but are scaled such that the standard deviation in each of the K directions is 5% of the climatological standard deviation (σclim = 5.07; Wilks 2005).
To simulate the imperfection of the analysis, the “true” values of Xk are perturbed with
ea2
for all initial points. Here denotes the Cholesky decomposition of a matrix and z a K = 8 dimensional vector of independent Gaussian random draws that is different for each initial point.
Centered on this analysis, ensembles are initialized through the following:
ea3
where again zj are vectors of independent Gaussian random draws.

REFERENCES

  • Anderson, L. J., 1996: Selection of initial conditions for ensemble forecast in a simple perfect model framework. J. Atmos. Sci., 53, 2236.

    • Search Google Scholar
    • Export Citation
  • Bishop, C. H., 2008: Bayesian model averaging’s problematic treatment of extreme weather and a paradigm shift that fixes it. Mon. Wea. Rev., 136, 46414652.

    • Search Google Scholar
    • Export Citation
  • Bröcker, J., and L. A. Smith, 2007: Increasing the reliability of reliability diagrams. Wea. Forecasting, 22, 651661.

  • Fortin, V., A.-C. Favre, and M. Said, 2006: Probabilistic forecasting from ensemble prediction systems: Improving upon the best-member method by using a different weight and dressing kernel for each member. Quart. J. Roy. Meteor. Soc., 132, 13491369.

    • Search Google Scholar
    • Export Citation
  • Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Wea. Rev., 133, 10981118.

    • Search Google Scholar
    • Export Citation
  • Hagedorn, R., T. Hamill, and J. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part I: Two-meter temperatures. Mon. Wea. Rev., 136, 26082619.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550560.

  • Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 32093229.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts. Mon. Wea. Rev., 132, 14341447.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., J. S. Whitaker, and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87, 3346.

    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 26202632.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., R. Tibshirani, and J. Friedman, 2001: The Elements of Statistical Learning. 1st ed. Springer, 524 pp.

  • Lorenz, E. N., 1996: Predictability—A problem partly solved. Proc. ECMWF Seminar on Predictability, Vol. 1, Reading, United Kingdom, ECMWF, 1–18.

    • Search Google Scholar
    • Export Citation
  • Lorenz, E. N., 2005: Designing chaotic models. J. Atmos. Sci., 62, 15741587.

  • Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595600.

  • Orrell, D., 2003: Model error and predictability over different timescales in the Lorenz96 systems. J. Atmos. Sci., 60, 22192228.

  • Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133, 11551174.

    • Search Google Scholar
    • Export Citation
  • Richardson, D., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126, 649667.

    • Search Google Scholar
    • Export Citation
  • Roulston, M. S., and L. A. Smith, 2003: Combining dynamical and statistical ensembles. Tellus, 55A, 1630.

  • Schaake, J., T. Hamill, R. Buizza, and M. Clark, 2007: HEPEX: The Hydrological Ensemble Prediction Experiment. Bull. Amer. Meteor. Soc., 88, 15411547.

    • Search Google Scholar
    • Export Citation
  • Smith, L., 2001: Disentangling uncertainty and error: On the predictability of nonlinear systems. Nonlinear Dynamics and Statistics, A. I. Mees, Ed., Birkhauser, 31–64.

    • Search Google Scholar
    • Export Citation
  • Wang, X., and C. H. Bishop, 2005: Improvement of ensemble reliability with a new dressing kernel. Quart. J. Roy. Meteor. Soc., 131, 965986.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2005: Effects of stochastic parametrizations in the Lorenz ‘96 system. Quart. J. Roy. Meteor. Soc., 131, 389407.

  • Wilks, D. S., 2006a: Comparison of ensemble-MOS methods in the Lorenz ‘96 setting. Meteor. Appl., 13, 243256.

  • Wilks, D. S., 2006b: Statistical Methods in the Atmospheric Sciences. 2nd ed. International Geophysics Series, Vol. 91, Academic Press, 648 pp.

    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2009: Extending logistic regression to provide full-probability-distribution MOS forecasts. Meteor. Appl., 16, 361368.

  • Wilks, D. S., and T. M. Hamill, 2007: Comparison of ensemble-MOS methods using GFS reforecasts. Mon. Wea. Rev., 135, 23792390.

  • Ziehmann, C., 2000: Comparison of a single-model EPS with a multi-model ensemble consisting of a few operational models. Tellus, 52A, 280299.

    • Search Google Scholar
    • Export Citation
1

Here dynamical ensembles have to be distinguished from statistical ensembles. A dynamical ensemble is a set of NWP model forecasts started from slightly different initial conditions and (possibly) run with different model physics while statistical ensembles are samples of historical data. In the following, subscript ens is used exclusively for the dynamical ensemble while subscripts analogs and dens are used for statistical ensembles of analogs.

2

Except for the first member (initialized with the analysis). However, the difference is marginal and therefore disregarded.

Save