## 1. Introduction

In general, verification of weather forecasts is said to serve two purposes: to ensure the quality of an existing weather forecasting system and to compare two or more forecasting systems objectively (Murphy 1991; Jolliffe and Stephenson 2011). Two other purposes are often deduced from these first: to estimate the time of loss of information in the forecasts (level of predictability; Murphy and Winkler 1987) and to provide the developers of the forecasting system with information about the spatiotemporal structure and amplitude of systematic and random errors for further improvements of the forecasting system.

The statistical verification methods applied are to a very large extent based on measures between forecasted and observed values for a single station, single variable, and single forecasting lead time. Classical univariate statistics are taken for continuous, Gaussian distributed, and positively correlated observations and forecasted values (Jolliffe and Stephenson 2011). In case of dichotomous variables, discrete bivariate statistics based on the joint probabilities estimated by contingency tables are used (Stephenson and Doblas-Reyes 2000).

Over the past two decades, weather forecasting has been extended to predict not only a single value but furthermore an estimate of its uncertainty through elaborate methods in ensemble forecasting. The predictive element is not a single number anymore but a probability density function (PDF) or cumulative distribution function (CDF). This requires an extended approach to verification because a function (the predictive PDF) has to be compared to a single observation in order to evaluate the information content of the forecast. The review by Gneiting and Raftery (2007) provides an in-depth discussion of the problem based on the so-called proper scores forming the basis for an adequate verification of predictive PDF/CDF’s not only in meteorology and weather forecasting but also in econometrics (Tsyplakov 2013).

The use of single station, single variable, and single lead time verification neglects the spatiotemporal dependency structures present in forecasts and observations. These dependencies arise from the physical laws connecting different variables at neighboring points in space and time. Including these dependencies can have considerable influence on the estimated level of predictability (Röpnack et al. 2013). Ignoring them is synonymous with ignoring the physical background of atmospheric dynamics in verification. The necessary statistics for this verification are called multivariate (Anderson 1958), irrespective of regarding different variables, different spatiotemporal grid points, or both. The unpredictable variability of a point forecast in the univariate analysis can well disguise the predictable part. Nevertheless, the predictable signal can be enhanced considerably by using appropriate spatiotemporal smoothing or averaging, based on the inherent correlation structure of the unpredictable component as an interdependency measure. In climate research, this is known as optimal fingerprinting (Hasselmann 1993) or predictable component analysis (DelSole and Tippett 2007). The construction of climate indices (North Atlantic Oscillation and Southern Oscillation) is another way to deal with predictability on seasonal to decadal time scales (Stenseth et al. 2003). Fingerprinting can furthermore be used to test for physical processes (Jonko et al. 2010; Paeth et al. 2006) responsible for predictability. Thereby it provides insights into possible mechanisms of predictability or the processes behind systematic errors. Although being more complex, the extension of verification to spatiotemporal fields represented by grid points is an exercise that can provide useful information especially for predictability studies and the inclusion of physical process studies into verification. Besides that, this always includes the classical univariate approach as a special case. Combining this with the uncertainty estimation via ensemble prediction leads to the study of multivariate predictive PDFs and their verification by a given field of observations. As a first step, it requires the postprocessing of the ensemble realizations in order to derive the predictive PDF. As a second step, the selection of appropriate score functions and scores for the verification part is necessary. We will apply both steps to existing ensemble predictions from The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE; Bougeault et al. 2010) database to assess the quality of the individual ensembles and to compare objectively between the selected ensemble prediction system (EPS).

The multivariate methods proposed are extensions of known postprocessing methods from climate model analysis. Schölzel and Hense (2011) introduced the multivariate Gaussian ensemble kernel dressing (EKD) as a multivariate extension to the usual Gaussian ensemble kernel dressing (Wilks 2002) in order to account for ensemble uncertainty. As the estimated variance-covariance matrix is based on a well-defined noise model, this method allows for a prewhitening operation in order to improve the estimation of the unpredictable noise component. The ensemble kernel dressing, analogous to its univariate version, is able to resolve the set of ensemble realizations into subset events of different probabilities. Schölzel and Hense (2011) applied this method to climate projections. In this study, the multivariate ensemble kernel dressing method is applied to weather forecasts similar as in Röpnack et al. (2013) and compared to a simple multivariate Gaussian distribution fit (GDF). The univariate PDFs can be easily obtained through marginalization (integrating out from the predictive PDF all other variables except the chosen one).

Recently, a few multivariate approaches have been applied to weather forecasts but mostly refer to the aspect of treating different models or parameters instead of focusing on spatiotemporal dependencies; for example, Möller et al. (2013) use a univariate Bayesian model averaging approach for different variables and combine them using a Gaussian copula model to obtain a multivariate predictive PDF. Gneiting et al. (2008) and Schuhen et al. (2012) used a bivariate PDF with respect to wind directions as a basis for postprocessing of two-dimensional vector wind forecasts.

As an additional extension to these multivariate ensemble postprocessing methods, the application of the Graphical Least Absolute Shrinkage and Selection Operators (GLASSO) algorithm introduced by Friedman et al. (2008) is proposed. This is a parameterization of the inverse covariance (or precision) matrix from the most commonly singular and therefore noninvertible sample covariance matrix. The precision matrix is the relevant matrix that determines the predictive distribution. Sometimes it has been argued (Chervin 1981) that it is not possible to perform Gaussian multivariate statistics on typical climate and weather forecasting models due to the singular character of the sample covariance matrix arising from the large dimension of the phase state vector. The GLASSO algorithm provides an attractive alternative by estimating a sparse and nonsingular precision matrix. The inverse of this matrix is in terms of an appropriate matrix norm as close as possible to the original maximum likelihood (ML) sample covariance matrix. Sparseness is controlled by a regularization parameter that allows the study of sensitivity of the results with respect to the GLASSO parameterization of the precision matrix. GLASSO has not been used frequently in meteorological statistics. It is more often applied in fields of network analysis and system biology (e.g., Menéndez et al. 2010).

As mentioned above, verification methods determine the quality of individual EPS or evaluate differences between EPS. Murphy and Winkler (1987) noted that verification should be done in terms of probabilistic approaches that use the respective PDFs of model and observation. The basis of probabilistic forecast verification is the concept of strictly proper scores and their scoring rules (Gneiting and Raftery 2007), which allow the above two tasks to be evaluated as objectively as possible. Important aspects of assessing the quality of probabilistic verification are reliability/calibration, resolution/sharpness, and uncertainty explained by Bröcker (2009), for example. Resolution or sharpness is a measure for the systematic deviation between the probabilistic information of all observations (climatology) and that subset of observations that are selected on the basis of a given specific probabilistic forecast. Reliability is a measure of the similarity between the actual probabilistic forecast and that same subset of observations, both averaged over all forecasts. Uncertainty characterizes the overall probabilistic information content of observations when using a climatological forecast that is by definition perfectly reliable. For a negatively oriented score (“the smaller the better”), Bröcker (2009) has shown that any strictly proper score can (always) be decomposed into uncertainty plus reliability minus resolution with all three components being positive numbers by the propriety property. A skillful forecasting system is characterized by large resolution reducing the unavoidable uncertainty without being affected by a large (un)reliability component. The estimation of these components from a joint forecast/observation dataset has been described for dichotomous and univariate variables (Murphy 1973; Hersbach 2000), but is yet unsolved for multivariate predictive PDFs. Therefore, we cannot present a decomposition in uncertainty, reliability, and resolution. However, the actual computation of the score from its score function often provides two different aspects: a general spread information of the predictive PDF and a similarity measure between the predictive PDF and the observations. The average of the first can be identified as a generalized entropy whereas the average of the second is a so called divergence between two PDFs. We will present this decomposition as additional information about the quality of the forecast and its verification.

Scores such as the Brier score (BS) and the continuous ranked probability score (CRPS) as univariate methods verify the ensemble forecast at the single point, single variable, and single lead time level. A score can be interpreted relatively to the score of a reference forecast such as a climatological forecast or persistence, thereby defining a skill score. Additionally, the shape of the analysis rank histogram (ARH) or the related probability integral transform (PIT) histogram is a necessary but not sufficient indicator for reliability of the given ensemble forecast on a univariate basis (Hamill 2001). Keller and Hense (2011) introduced the *β* score, which determines the dispersion of the ensemble at a given lead time. Although, Murphy and Epstein (1989) as well as Whitaker and Loughe (1998) showed that the spread of the ensemble may not be related to the predictive skill of the forecast. Nevertheless, these verification methods are of univariate dimension and do not evaluate multivariate fields. We employ the energy score by Gneiting et al. (2008) as a generalization of the CRPS in order to verify multivariate quantities. Regarding station forecasts, we will consider the complete forecast time series between initial and final lead time as a multidimensional vector including its temporal autocorrelation (cf. Fig. 1). Such a construction is known from dynamical systems theory as a delay vector (Packard et al. 1980; Takens 1981). Additionally, the evaluation of the probabilistic forecasts will be characterized in terms of a single value (i.e., the energy score). Besides the amplitudes at each lead day, the method evaluates the temporal dependence structures in forecasts over the lead time with respect to observations. These dependence structures are explicitly estimated for each forecast and furthermore for the climatological reference forecast consisting of observations. Accounting for temporal dependence structures allows us to answer questions such as, how accurate and successful is a single, complete forecast up to a selected lead time? Given a set of different forecasts, can periods of more or less successful forecasts be identified, thus indicating variations in predictability?

At first, we present two multivariate extensions of known postprocessing methods in order to derive the predictive PDFs from the raw ensemble predictions: EKD and multivariate GDF. We apply these methods to station forecasts using the delay vector construction for a single station and a single variable but for multiple time analysis. Second, the probabilistic verification is performed using the energy score and its skill score relative to a climatological forecast. The climatological PDF is constructed in the same way as the predictive one. Both, EKD and GDF need the precision matrix as uncertainty parameter. The precision matrix is estimated using the GLASSO algorithm. These methods are applied to four EPS from the TIGGE database for the period July 2010–June 2011.

In summary, the main objectives of this paper are the following:

Demonstrate multivariate postprocessing and verification methods for ensemble weather forecasts with focus on temporal developments.

Introduce new multivariate postprocessing techniques for ensemble weather forecasts (i.e., combinations of well-known methods: GDF and EKD) with the GLASSO parameterization in order to obtain better estimates of the error covariance in time.

Define the additional benefit of the multivariate assessment compared to univariate methods.

The examined data are depicted in section 2, whereas the multivariate postprocessing and verification methods are described in section 3. The study can be divided into two parts: first, the results and attributes of the two multivariate postprocessing methods, the original EPS forecasts, and the application of GLASSO to the inverse covariances are determined and analyzed. Second, the uncertainties of station forecasts are evaluated in terms of univariate and multivariate verification methods and the differences between the examined EPS are outlined. These results are the subject of section 4. Last, the additional benefit of the multivariate assessment is outlined in section 5.

## 2. Forecast and verification data

### a. Forecast data

Within this study, four global EPS with medium-range weather forecasts are examined:

The National Centers for Environmental Prediction (NCEP) ensemble prediction system, Global Ensemble Forecast System (GEFS; http://www.ncep.noaa.gov).

The Canadian Meteorological Centre (CMC) Global Ensemble Prediction System (GEPS; http://www.weatheroffice.gc.ca/canada_e.html).

The Met Office Global Regional Ensemble Prediction System (MOGREPS, later on referred to as MO-MOGREPS; http://www.metoffice.gov.uk/).

The European Centre for Medium-Range Weather Forecasts (ECMWF) EPS (http://www.ecmwf.int/).

Main features of the TIGGE ensembles used in this study (for July 2010–June 2011).

The sample period contains 365 model runs from 0000 UTC over one year from 1 July 2010 to 31 June 2011. The lead time is 10–16 days with 6 h of temporal resolution depending on the EPS. The forecasts are reduced to daily averages. The methods proposed in section 3 are applied to two near-surface parameters: the 2-m temperature and the anomalies of surface pressure from each local mean to avoid biases due to differing station heights in forecasts and observations.

### b. Observations and climatological data

For verification, daily means of observational data from the Deutscher Wetterdienst (DWD) synoptic stations (retrieved from the WebWerdis data portal online at http://www.dwd.de/webwerdis) for the described sample period are used. Additionally, 30 years of daily mean observations from 1979 to 2009 are taken as a reference for climate. The years 1979–2009 are chosen as the most recent years in comparison to the sample period of model runs. To increase the size of the reference climatological ensemble forecasts to 150 realizations, a window of ±2 days around the examined date for each year is chosen.

Because of model errors and insufficient model resolution, the ensemble forecasts are biased implying the need for a bias correction. The mean bias of the 2-m temperature over the observed period shows a linear trend over the lead time. A model with mean and linear trend was fitted for each station over the whole period. This is used to remove the bias from the predictive information. As the ECMWF-EPS pressure anomaly forecast shows an unusual bias at a lead time of 10 days, probably due to the change in horizontal resolution in the forecast model, we will consider 10-day forecasts only for the ECMWF-EPS.

Note that we do not distinguish between a training and a verification dataset as the proposed postprocessing methods are independent of the observations. Thus, the full sample period from July 2010 to June 2011 is used for verification.

## 3. Methods

The method section can be subdivided into two parts. The first part describes the proposed multivariate postprocessing and the second part describes the verification methods. The procedure is as follows: in a first step, we fit a multivariate distribution to each forecast separately (i.e., to each EPS and each forecast in the sample period). This multivariate distribution can be either a Gaussian distribution or an ensemble kernel dressing. Furthermore, as the estimation of the inverse covariance matrix for any multivariate distribution can be biased, we apply a GLASSO regularization to this inverse covariance matrix and estimate the predictive distribution. In a second step, the resulting multivariate distributions are verified with observations.

### a. Multivariate ensemble postprocessing

For calculating a multivariate predictive PDF based on a Gaussian approach, a full rank covariance matrix has to be estimated. Two multivariate approaches for estimating a multivariate PDF and full covariance matrices are presented. The approaches are analogous to univariate methods, but involve temporal or spatial dependencies.

The probabilistic information to be derived will be the probability density of the random vector with dimension *q* written in column form **X**: *f*_{X}(**x**, *θ*), where *θ* describes the set of parameters characterizing the density *f* and **x** is a realization of **X**.

#### 1) Gaussian distribution fit

**, Σ)]. The probability density for the multivariate Gaussian distribution is**

*μ***and Σ**

*μ*_{GDF}are estimated from the ensemble realizations of size

*m*with dimension

*q*using maximum likelihood estimation (von Storch and Zwiers 1999, p. 83). Assume that

**x**

_{i}is a column vector of real numbers and contains the prediction of ensemble member

*i*of one sample with dimension

*q*. Let the

*m*samples be independently and identically distributed (iid). Then the ML estimator for the expectation value

**reads**

*μ**q*×

*q*. The full derivation of a multivariate Gaussian distribution fit and the estimation of the covariance can be found in any multivariate statistical reference, such as von Storch and Zwiers (1999) or Anderson (1958). In the present case, we fit the GDF to each forecast and for each EPS separately and yield 365

*q*-dimensional PDFs that include the temporal autocorrelation of each forecast. Thus, each PDF describes the development of the respective ensemble forecast over lead time. The GDF can be related to the nonhomogeneous Gaussian regression (NGR; Gneiting et al. 2005) that is, in case of temporal forecasts, univariate in time and is characterized by a variable variance. In contrast, the GDF considers a fixed temporal covariance matrix but the variances can change at any of the

*q*time steps up to the lead time.

#### 2) Ensemble kernel dressing

Similar to the univariate kernel dressing and closely related to kernel density estimation (Silverman 1986) but extended to a multivariate approach, the Gaussian ensemble kernel dressing is used as a second postprocessing method. In contrast to the multivariate Gaussian distribution fit the final predictive PDF will be a non-Gaussian density in general. The approach has been presented by Schölzel and Hense (2011) and applied by Röpnack et al. (2013). In this section only the main idea is depicted.

**x**

_{i},

*i*= 1, …,

*m*the internal noise component

*ϵ*

_{i}, which is assumed to be additive to the “true“ model state

**x**, has to be considered:

*ϵ*

_{i}is a realization of a random vector

**with expectation zero and a density function**

*ε**f*

_{ε}(

*ϵ*). Then the density function for

**X**can be defined as the sum of the simulations dressed with the noise density

*f*

_{ε}(

*ϵ*):

The EKD estimate of the predictive PDF is a mixture model of the noise model where the individual components are centered on the ensemble realizations. The predictive PDF can be multimodal if the phase-space distance between the individual simulations is larger than the spread of the noise.

**Σ**

*ϵ*with dimension

*q*×

*q*, a prewhitening operation to all individual simulations is performed by considering all possible differences

**x**

_{i}−

**x**

_{j}. This should remove the true model state

**x**and leave estimates of

*ε*_{i}−

*ε*_{j}(Schölzel and Hense 2011):

**Σ**

_{EKD}=

*h*

_{opt}×

**Σ**

*ϵ*. Based on the theory of Silverman (Silverman 1986), the factor

*h*

_{opt}is calculated as

Again, we yield a *q*-dimensional, multivariate PDF for each forecast and each EPS.

#### 3) Estimation of sparse inverse covariance matrices

Neither GDF nor EKD explicitly require the estimation of the covariance matrix. It is rather necessary to know its inverse, the precision matrix. The seemingly direct way to calculate the inverse of the ordinary ML covariance matrix estimate leads to an asymptotically unbiased estimate of the precision matrix only if *q* ≪ *m* (i.e., the number of lead days *q* is much smaller than the number of ensemble realizations *m*). However, for many real-world applications, the number of lead days is not significantly larger than the number of ensemble realizations. In this case, the estimated precision matrix is strongly biased (Anderson 1958; Mazumder and Hastie 2012). To overcome this problem, we apply GLASSO to the estimated precision matrix. It describes a method that estimates a sparse precision matrix by an *L*_{1} (lasso) regularization. The inverse of this precision matrix is as close as possible to the ML covariance matrix measured in any matrix norm.

*L*

_{1}regularized negative log-likelihood minimization problem (Friedman et al. 2008):

*ρ*> 0 is the regularization parameter and

*L*

_{1}penalty term is determined by the regularization parameter

*ρ*. With

*ρ*= 0 there is no regularization. Increasing

*ρ*increases the number of entries in the precision matrix being exactly zero (the sparsity) due to the

*L*

_{1}penalty. For detailed derivations, see Friedman et al. (2008), Meinshausen and Bühlmann (2006), and Mazumder and Hastie (2012).

We apply GLASSO to both methods (i.e., GDF and EKD) using systematically varied regularization parameters *ρ*.

The results of the estimated precision matrices are examined using partial correlations. Partial correlations are defined as the entries of the appropriately normalized precision matrix. In contrast to marginal correlations (which are the entries of the normalized covariance matrix), the partial correlation between two variables describes their linear dependency in case that the linear dependencies from the remaining *q* − 2 variables have been removed from those two. Regarding two variables *x*_{i} and *x*_{j}, the resulting negative partial correlation is scaled with the partial variance (for *i* ≠ *j*), where *c*_{ii} and *c*_{jj} are entries of the precision matrix: **x**_{\i} depicts all *x*_{k} except for *x*_{i}. This indicates the excluded effect of all other variables on the correlation of *x*_{i} and *x*_{j}. Figure 2 compares the partial correlations for one specific forecast of the NCEP GEFS with different regularization parameters. A regularization parameter *ρ*_{gl} = 0 indicates the partial correlation matrix derived from the ML sample estimate of the covariance matrix by simply inverting it. It clearly shows the intended effect to increase the sparsity of the precision matrix. We will discuss this further in section 4.

Estimated covariances and partial correlations for an NCEP GEFS temperature forecast (model run 29 Nov 2010) over 16 lead days for the Hamburg station. The first matrix depicts the estimated covariance (a) for the MLE, and (d) its inverse illustrated as partial correlations. Applying GLASSO with a varying regularization parameter to this precision matrix yields the partial correlations (e) *ρ* = 0.1 and (f) *ρ* = 0.7. Inverting the corresponding precision matrix, leads to the covariance matrices (b) *ρ* = 0.1 and (c) *ρ* = 0.7. Note the increase in variance in (b) and (c) on the diagonal for lead days 1–4 and the reduction in covariance between lead days 1–4 and the remaining days.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Estimated covariances and partial correlations for an NCEP GEFS temperature forecast (model run 29 Nov 2010) over 16 lead days for the Hamburg station. The first matrix depicts the estimated covariance (a) for the MLE, and (d) its inverse illustrated as partial correlations. Applying GLASSO with a varying regularization parameter to this precision matrix yields the partial correlations (e) *ρ* = 0.1 and (f) *ρ* = 0.7. Inverting the corresponding precision matrix, leads to the covariance matrices (b) *ρ* = 0.1 and (c) *ρ* = 0.7. Note the increase in variance in (b) and (c) on the diagonal for lead days 1–4 and the reduction in covariance between lead days 1–4 and the remaining days.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Estimated covariances and partial correlations for an NCEP GEFS temperature forecast (model run 29 Nov 2010) over 16 lead days for the Hamburg station. The first matrix depicts the estimated covariance (a) for the MLE, and (d) its inverse illustrated as partial correlations. Applying GLASSO with a varying regularization parameter to this precision matrix yields the partial correlations (e) *ρ* = 0.1 and (f) *ρ* = 0.7. Inverting the corresponding precision matrix, leads to the covariance matrices (b) *ρ* = 0.1 and (c) *ρ* = 0.7. Note the increase in variance in (b) and (c) on the diagonal for lead days 1–4 and the reduction in covariance between lead days 1–4 and the remaining days.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

### b. Probabilistic verification methods

The presented ensemble postprocessing methods are verified with observations, and the skill of the forecasts with respect to a climatological forecast can be defined. We verify the original ensemble predictions in comparison to the multivariate probabilistic approaches and compare the different EPS. It is recalled that the proposed postprocessing methods are independent of the observations used for verification and that the resulting predictive densities are not calibrated using observations.

The quality of an ensemble forecast in terms of accuracy defines the pairwise agreement of forecast and observation and is often described by scores. The use of strictly proper scoring rules is recommended (Gneiting and Raftery 2007). Attributes like reliability, discrimination, resolution, and sharpness are more often used to describe forecast quality and are defined according to the analyzed problem (e.g., Murphy 1993).

To check for the overall quality of each EPS, we split verification into a necessary and sufficient part. A necessary condition for a reliable forecasts having sufficient resolution is potential reliability and sharpness. These estimates are obtained using the analysis rank histogram and the predicted ensemble variance that provide information about the spread of the ensemble relative to that of the observations. A flat ARH and a predicted ensemble variance less than that of a forecast based on climatic information is a necessary, but not sufficient condition for obtaining reliable and sharp predictions given individual observations.

Especially for short lead times, the dispersion of ensemble forecasts is known to be commonly too small (Hamill and Colucci 1997). One question concerns whether or not the two postprocessing methods are able to improve this uncertainty estimate.

Note that the observation is assumed to be perfect although it is uncertain due to representation or measuring errors and differences in time and space. A comparative analysis with observational uncertainty included was studied by Röpnack et al. (2013).

#### 1) Analysis rank histogram and *β* score

*y*(Berkowitz 2001):

*β*score refers to the

*β*distribution fitted to the histogram by MLE or method of moments (Keller and Hense 2011):

*β*score an underdispersive (overdispersive) ensemble. Note that an inner bimodal structure of the ARH is not reproduced by the

*β*score. The ML estimation of the two parameters

*α*and

*β*provides confidence intervals, which can be used to calculate confidence intervals for

*β*

_{S}.

This method for verifying potential reliability can be used for the unprocessed as well as for the postprocessed ensemble forecasts using the PIT representation.

#### 2) Continuous ranked probability score and energy score

As a standard, univariate verification method for continuous and real-valued variables, the CRPS is a proper scoring rule. It is the integral of the Brier score for probabilistic forecasts and thus can be decomposed into reliability, resolution and uncertainty (e.g., Grimit et al. 2006; Hersbach 2000). The following description refers to the mathematical explanation of Gneiting and Raftery (2007) and Gneiting et al. (2008).

*s*of the CRPS is a distance measure between the probabilistic forecast with the CDF

*F*

_{X}(

*x*) and the observation

*y*:

*y*is the verifying observation and

*H*(

*x*−

*y*) is the Heaviside function that is 1 for

*x*−

*y*≥ 0 and 0, otherwise. The actual CRPS is obtained as an average over all observations:

*F*

_{Y}(

*y*) is the CDF of the density function of the observations

*f*

_{Y}(

*y*) and the last identity shows that the CRPS is estimated as the average over all score function values derived from the

*K*available observations

*y*

_{k}and predictive CDFs

*s*

_{crps}can also be expressed as (Baringhaus and Franz 2004)

*E*

_{X}{ } denotes the expectation value with respect to the PDF of the forecasts

*f*

_{X}(

*x*) and

*X*and

*X*′ are two independent random variables sharing the same density

*f*

_{X}(

*x*). The first term describes a divergence between forecast and observation, whereas the second term describes a forecast entropy that determines the information content of the ensemble (cf. Hersbach 2000). Both, the CRPS score function as well as the CRPS itself yield a summary verification measure with the unit of the investigated forecast and smaller values indicate better forecasts. Analytical solutions of the integral in Eq. (12) are defined in (Grimit et al. 2006).

*F*

_{X},

*F*

_{y}) and CRPS(

*F*

_{clim},

*F*

_{y}) are calculated according to the last identity in Eq. (14):

**X**and observation

**y**:

**X**and

**X**′ are independent random vectors with distribution

*f*

_{X}(

**x**). Drawing

*M*Monte Carlo realizations

**x**

_{j}from the predictive PDF

*f*

_{X}(

**x**), the energy score function can be estimated as (Gneiting et al. 2008)

The energy score is obtained by averaging the energy score function over all *K* available observations **y**_{k} and predictive PDFs *f*_{X,k}(**x**), analog to Eq. (14).

An energy skill score (ESS) can be defined in complete analogy to the CRPSS. Furthermore, an equivalent partitioning into a divergence between forecast and observations (first term) and a forecast entropy (second term) of Eq. (17) is possible. The interpretation of the energy score is similar: the smaller the energy score, the better the forecast. The value of the energy score illustrates the total accuracy of the multivariate forecast. Furthermore, the mean divergence between forecast and observation can be interpreted with respect to the dimensionality of the multivariate forecast. For instance, consider temperature forecasts over 10 days of lead time that result in an energy score of 6°C with a divergence of 10°C and a forecast entropy of 4°C. The forecast entropy can be interpreted as a 10-day mean ensemble spread of 0.4°C (which is usually not the case for weather forecasts where we observe heteroscedasticity, cf. Fig. 1) and the divergence as a root-mean-square error (RMSE) of 1°C over the lead time. That implies that the deviation of the forecast from the observation is on average 2.5 times higher than the uncertainty of the ensemble.

## 4. Results

Within this section we study the results of the ensemble postprocessing in order to derive predictive PDFs and furthermore their verification when applied to the data described in section 2. All computations have been performed using the R statistical software system (R Core Team 2013) and furthermore the GLASSO package (Friedman et al. 2011). The focus of this study is on the application and the advancement of multivariate postprocessing and verification methods. We do not consider the results as a final assessment of the participating forecasting systems due to the limited sample period 1 July 2010–31 June 2011. Nevertheless, we assume 365 model runs to be a representative sample size to present the advantages of the multivariate approach. Additionally, we emphasize that the forecasts are not calibrated with respect to observations.

### a. Ensemble postprocessing

The estimated covariance matrices for the Gaussian distribution fit and the ensemble kernel dressing show similar patterns. Silverman’s factor of 0.79 for 20 ensemble members and over 16 days produces a variance of the EKD at the same order of magnitude as the GDF. The structures of the estimated covariances are similar but the prewhitening effect of the EKD approach has to be emphasized. Both types of covariance matrices do not exhibit a Toeplitz form as it is the case for climate simulations and stationary forecasts (Schölzel and Hense 2011).

By construction, both types of covariance matrices are of full rank. The precision matrices can be directly estimated and thus compared to the matrices parameterized with GLASSO. One example of the ML estimated covariance and its inverse as well as two applications of GLASSO to the precision matrix are shown in Fig. 2. The figure illustrates the estimated covariance matrix based on the original 20 ensemble realizations from one temperature forecast (29 November 2010) over 16 lead days of the NCEP GEFS for the Hamburg-Fuhlsbuettel station. Figure 2a depicts the ML estimated covariance and Fig. 2d its partial correlations. It is evident that the lead day forecasts are partially highly (positive and negative) correlated, which could be an indicator for a biased estimator of the precision matrix based on the inverse of the ML estimator. The GLASSO algorithm is now applied to the precision matrix related to the partial correlations depicted in Fig. 2d. The result is shown in Figs. 2e,f: with increasing regularization parameter *ρ*, the partial correlations decrease. The application of GLASSO apparently selects relevant partial covariances and removes the irrelevant ones, thereby increasing the variances. This results in an increased sparseness with increasing regularization parameter. By inspection of all results, we found that the size of the regularization parameter at which the sparseness of the precision matrix reaches a specific level depends on the forecast run, the forecast station, and the predicted parameter without obvious patterns. Inverting the parameterized precision matrices results in covariance matrices comparable to the original ML estimation.

Having available a full rank precision matrix, the fitted predictive densities are completely defined. This offers the possibility of using Monte Carlo sampling to generate additional ensemble realization. This is used for estimating the energy score function [cf. Eq. (17)] in both the EKD and GDF cases based on 1000 samples from the respective multivariate PDF.

### b. Verification

In this section we present the results of the verification methods. First, the dispersion of the ensemble prediction system and their calibration/reliability is illustrated by the ARH and the respective *β* score. Secondly, we compare the univariate verification for each lead day using the standard univariate CRPS with the multivariate verification over the sample period using the energy score. We find a large variability of the individual energy score function values between July 2010 and June 2011 for all four EPS.

#### 1) Analysis rank histogram and *β* score

The ARH is based on the empirical distribution function and illustrates the dispersion of the predictive PDF relative to the observations. The *β* score assigns this dispersion to a single value. Figure 3 shows the resulting *β* scores for the original and in dispersion uncalibrated ensemble forecasts from all examined EPS for temperature forecasts in Hamburg, Germany. Especially at the beginning of the lead time, the CMC GEPS reveals the highest dispersion characteristics.

The *β* score characterizing the analysis rank histogram for the original ensemble forecasts (NCEP GEFS, Met Office MOGREPS, ECMWF EPS, and CMC GEPS) for 2-m temperature forecasts in Hamburg. The gray shaded contours frame the 90% confidence interval estimated from the MLE of the *β*-distribution parameters.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

The *β* score characterizing the analysis rank histogram for the original ensemble forecasts (NCEP GEFS, Met Office MOGREPS, ECMWF EPS, and CMC GEPS) for 2-m temperature forecasts in Hamburg. The gray shaded contours frame the 90% confidence interval estimated from the MLE of the *β*-distribution parameters.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

The *β* score characterizing the analysis rank histogram for the original ensemble forecasts (NCEP GEFS, Met Office MOGREPS, ECMWF EPS, and CMC GEPS) for 2-m temperature forecasts in Hamburg. The gray shaded contours frame the 90% confidence interval estimated from the MLE of the *β*-distribution parameters.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

The influence of the postprocessing methods on the dispersion characteristics of the examined EPS and the resulting representation of uncertainty is shown in Figs. 4 and 5. The resulting PIT histograms based on the distribution functions of the fitted GDF predictive density and their *β* scores for the 2-m temperature forecasts with GLASSO applied are shown for Hamburg station at lead days 1, 4, and 9 of the NCEP GEFS forecasts. GLASSO results are shown for systematically varied regularization parameters: *ρ* = 0.1, *ρ* = 0.4, *ρ* = 0.6, and *ρ* = 1. Figure 4 furthermore depicts the *β* score for the EKD-fitted forecasts (without GLASSO applied). It is noted that Figs. 4 and 5 show the result of fitting multivariate distributions to the ensemble for each forecast separately and additionally applying GLASSO to the precision matrix and, subsequently, verifying these predictive densities with observations. Therefore, the resulting densities are not calibrated using observations.

As in Fig. 3, but after postprocessing with EKD, GDF, and GDF (GLASSO) and different regularization parameters *ρ* for temperature forecasts of NCEP GEFS.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

As in Fig. 3, but after postprocessing with EKD, GDF, and GDF (GLASSO) and different regularization parameters *ρ* for temperature forecasts of NCEP GEFS.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

As in Fig. 3, but after postprocessing with EKD, GDF, and GDF (GLASSO) and different regularization parameters *ρ* for temperature forecasts of NCEP GEFS.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

PIT histograms (from top to bottom) for GDF estimated NCEP GEFS forecasts for 2-m temperature at Hamburg station with different regularization parameters from GLASSO and (from left to right) for different lead day.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

PIT histograms (from top to bottom) for GDF estimated NCEP GEFS forecasts for 2-m temperature at Hamburg station with different regularization parameters from GLASSO and (from left to right) for different lead day.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

PIT histograms (from top to bottom) for GDF estimated NCEP GEFS forecasts for 2-m temperature at Hamburg station with different regularization parameters from GLASSO and (from left to right) for different lead day.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Instead, the GLASSO method leads to increased variance especially at the beginning of the forecast. This is the result of the reduced partial correlations with increased sparseness in the precision matrix. It is clearly visible in the PIT histograms and the respective *β* score for the GDF postprocessed ensembles and has already been demonstrated in Fig. 2.

Comparing the *β* scores for different postprocessing methods with different regularization parameters, it emerges that an increasing regularization parameter leads to increased dispersion. This is evident in Fig. 5 and the respective *β* scores in Fig. 4. It is mostly connected to the highest underdispersion for the first lead days. This gain in calibration by enhanced dispersion is accompanied by an additional bias through the application of GLASSO (cf. Fig. 5, e.g., during the first lead day and with regularization parameter of *ρ* ~ 0.4).

In this study, the EKD and GDF postprocessing methods do not differ considerably as their resulting scores are not significantly different. Nevertheless, we assume the EKD technique to perform better regarding temporal higher resolved forecasts (e.g., hourly) or spatial fields of adequate resolutions.

#### 2) CRPSS and energy score

The CRPS is a univariate verification method that we use to identify the mean accuracy of the probabilistic forecasts for each lead day. With reference to the climatological forecast, the CRPSS in Fig. 6 shows the expected loss of predictability over the lead time. The slow decay of the CRPSS implies predictive skill up to 10 days of lead time for single stations. Only minor differences between the EPS can be identified.

Mean continuous ranked probability skill score (CRPSS) for all examined models and both analyzed parameters (gray: pressure anomalies, black: temperature) for postprocessed forecasts with EKD at the Hamburg station.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Mean continuous ranked probability skill score (CRPSS) for all examined models and both analyzed parameters (gray: pressure anomalies, black: temperature) for postprocessed forecasts with EKD at the Hamburg station.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Mean continuous ranked probability skill score (CRPSS) for all examined models and both analyzed parameters (gray: pressure anomalies, black: temperature) for postprocessed forecasts with EKD at the Hamburg station.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

The energy score is the multivariate generalization of the CRPS and we use it to evaluate the complete forecast run over 10 or 15 days. Thus, it includes the temporal autocorrelation over the lead time. Therefore, we define predictability as a low energy score from multivariate forecasts in time. In Fig. 7 the energy score over 10 lead days is depicted for Frankfurt-Airport station for pressure anomaly forecasts over the sample period and for all examined EPS. The resulting energy score functions for the available forecasts show similar characteristics over the sample period, except for some outliers. Consequently, more predictable and less predictable periods can be identified. For instance, we can find the energy score function of all postprocessed ensemble predictions to be low for three weeks at the end of July and the beginning of August. On the contrary, high values of the divergence between forecast and observation at the end of October 2010 or the end of January 2011 lead to high energy scores and thus indicate periods with lower predictability. A mean divergence between forecast and observation of 18.65 hPa for the ECMWF EPS forecasts over the sample period is comparable to a 10-day RMSE of 1.87 hPa. Comparing the divergence between forecast and observation to the forecast entropy of the ensemble, it is revealed that the divergence term is the dominating term for predictability. In general, the time series of the energy score function indicates periods with lower predictability during autumn and winter until the end of March as both parts of the energy score function, divergence and entropy, are increased. Lower values of the energy score function and thus periods of increased predictability can be identified in summer 2010 and spring 2011.

Energy score function for GDF postprocessed pressure anomaly forecasts for Frankfurt station over the observed period. The energy score function is furthermore separated into the two parts forecast entropy and divergence between forecast and observation [cf. Eq. (17)]. As a reference, the climatological energy score is depicted (gray).

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Energy score function for GDF postprocessed pressure anomaly forecasts for Frankfurt station over the observed period. The energy score function is furthermore separated into the two parts forecast entropy and divergence between forecast and observation [cf. Eq. (17)]. As a reference, the climatological energy score is depicted (gray).

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Energy score function for GDF postprocessed pressure anomaly forecasts for Frankfurt station over the observed period. The energy score function is furthermore separated into the two parts forecast entropy and divergence between forecast and observation [cf. Eq. (17)]. As a reference, the climatological energy score is depicted (gray).

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

The differences between the multivariate methods applied and the original EPS are negligible. As both terms (divergence and forecast entropy) are increased by the application of the present postprocessing methods, the resulting energy score function does not change significantly. Similar results are obtained by applying GLASSO to the precision matrix. Therefore, the three depicted postprocessing methods do not show large differences for the given parameters and temporal structures, even for a relatively high regularization parameter of *ρ* = 1 within GLASSO. It should be noted that the results could be very different for spatial or intervariable dependencies and especially non-Gaussian variables (e.g., bimodal or skewed distributions).

Similar to the CRPSS, an ESS can be calculated with reference to a climatological forecast. This allows for an objective assessment of predictability. The climatological forecast is a highly reliable forecast with low resolution. As we do not distinguish between reliability and resolution but between divergence and forecast entropy, this is not apparent in Fig. 7. This underlines the difference between divergence and reliability and furthermore between entropy and resolution. Therefore, we focus on predictability as the skill of the energy score over the reference climatological forecast as depicted in Fig. 8. The box-whisker plots and the mean values of the energy skill score confirm the highest predictability of the ECMWF EPS pressure anomaly forecasts with about 50% more skill than the reference climatological forecast although the comparison of all EPS shows only minor differences. Regarding pressure anomaly forecasts, only the ECMWF EPS forecasts exhibit significant skill compared to climate.

Box-whisker plots of energy skill scores for GDF postprocessed (a) temperature and (b) pressure anomaly forecasts for all examined EPS for Stuttgart station over the sample period. The skill score is calculated relative to a climatological forecast. The crosses denote the mean energy skill score.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Box-whisker plots of energy skill scores for GDF postprocessed (a) temperature and (b) pressure anomaly forecasts for all examined EPS for Stuttgart station over the sample period. The skill score is calculated relative to a climatological forecast. The crosses denote the mean energy skill score.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Box-whisker plots of energy skill scores for GDF postprocessed (a) temperature and (b) pressure anomaly forecasts for all examined EPS for Stuttgart station over the sample period. The skill score is calculated relative to a climatological forecast. The crosses denote the mean energy skill score.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Comparing the different EPS systems, only minor differences can be found. The main differences between the EPS can be traced back to the resolution–reliability or divergence–entropy relation. While the CMC GEPS exhibits the highest forecast entropy, lower values of these measures contribute to the resulting energy skill score of ECMWF EPS and Met Office MOGREPS forecasts. Nevertheless, it has to be noted that the significance of the results may be increased using a more sensitive multivariate score with increased discrimination ability. In conclusion, the investigated mean and median energy skill scores exhibit some differences between the examined forecasts. Particularly, less information content of the CMC GEPS estimated predictive distribution is proved and the Met Office MOGREPS forecasts exhibit the lowest divergence and entropy. Last, the energy skill score for the ECMWF EPS forecasts indicates the highest predictability.

## 5. Additional benefit of the multivariate assessment

The additional benefit of the multivariate postprocessing and verification techniques can be defined using (i) the univariate CRPSS and (ii) its multivariate analog, the energy score.

### a. CRPSS

A univariate Gaussian distribution *μ*, *σ*) with mean *μ* and standard deviation *σ* is fitted to each ensemble forecast for each lead day separately and we draw 1000 realizations from this distribution. The results are related to the multivariate GDF but ignore the temporal correlations in the forecast. The two sets of samples (univariate and multivariate) are compared using the CRPSS. In this case, the CRPS of the multivariate sample is related to the CRPS of the univariate sample. In other words, a positive (negative) CRPSS indicates more (less) skill of the multivariate postprocessing compared to the univariate postprocessing. Furthermore, we compare the multivariate postprocessing procedures including GLASSO to the univariate analogs. Figure 9 shows the respective CRPSS for the postprocessing of two ensemble forecasts (NCEP GEFS and Met Office MOGREPS) and thus indicates the additional benefit of the multivariate postprocessing methods compared to univariate methods for each lead day separately. The CRPSS exhibits positive skill for the multivariate postprocessing especially during the first few lead days. Additionally, the CRPSS shows more skill using higher regularization parameters in GLASSO (cf. *β* score in Fig. 4). Nevertheless, the multivariate postprocessing methods can have less skill than the respective univariate methods, especially at the end of the lead time.

CRPSS of multivariate postprocessed (GDF) temperature forecasts related to univariate postprocessed forecasts (univariate GDF) of NCEP GEFS and Met Office MOGREPS at Dresden station.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

CRPSS of multivariate postprocessed (GDF) temperature forecasts related to univariate postprocessed forecasts (univariate GDF) of NCEP GEFS and Met Office MOGREPS at Dresden station.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

CRPSS of multivariate postprocessed (GDF) temperature forecasts related to univariate postprocessed forecasts (univariate GDF) of NCEP GEFS and Met Office MOGREPS at Dresden station.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

### b. Energy score

By construction, the sum of the energy score of all investigated stations is comparable to a spatiotemporal combination of forecasts. The spatiotemporal combination of the forecasts is an 80-dimensional (8 stations with each 10 days of lead time) forecast that considers not only autocorrelation in time but furthermore between the stations. The same multivariate postprocessing methods as in section 4 have been applied although it has to be mentioned that the application of GLASSO is necessary as the resulting 80-dimensional matrix is not invertible. Having no more than 20 realizations (e.g., for NCEP GEFS forecasts), we applied GLASSO with a small regularization parameter of *ρ* = 0.1. The resulting probability density is verified with observations using the energy score. This energy score is compared to the sum of the energy scores of all stations (postprocessed with a regularization parameter of *ρ* = 0.1). The difference between the two energy scores is due the spatial autocorrelation between the stations. Figure 10 shows box-whisker plots of the respective energy skill scores of the energy score including spatial and temporal autocorrelation relative to energy score that is obtained from eight stations with temporal autocorrelation included only. The positive ESS is about 0.6 for both variables, temperature, and pressure anomalies. These results show two aspects. First, it underlines the additional benefit of considering physical autocorrelation. At least for the present setup, the verification of the probabilistic field forecasts including all correlations yields better verification scores than the one which excludes the correlation. Second, the example shows that the application of the proposed postprocessing and verification methods to spatiotemporal forecasts is readily possible. This encourages further applications of multivariate postprocessing and verification methods.

Box-whisker plots of the energy skill score of the spatiotemporal combination compared to the sum of the (a) temporal postprocessed temperature and (b) pressure anomaly forecasts.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Box-whisker plots of the energy skill score of the spatiotemporal combination compared to the sum of the (a) temporal postprocessed temperature and (b) pressure anomaly forecasts.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

Box-whisker plots of the energy skill score of the spatiotemporal combination compared to the sum of the (a) temporal postprocessed temperature and (b) pressure anomaly forecasts.

Citation: Monthly Weather Review 142, 11; 10.1175/MWR-D-14-00015.1

## 6. Conclusions

The accuracy of probabilistic weather forecasts can be addressed through the use of proper scores and their scoring rules. Although a number of studies use ensemble forecasts for probabilistic forecasting, the available methods are usually applied on single grid point, single lead time, and single variable levels using univariate statistics. This does not account for temporal and spatial correlations that are a result of the physical interactions between grid points, lead times, or variables.

Within this study we applied and evaluated two multivariate approaches that account for temporal correlations and that are readily applicable to spatial as well as intervariable correlation structures. Our approaches combine multivariate postprocessing methods in order to extract the probability information from the original ensemble realizations with a multivariate probabilistic verification method based on the energy score. The aim is to demonstrate that multivariate ensemble postprocessing and verification methods and their application to different EPS are able to reveal different forecast characteristics such as changes in predictability or similarities among different EPS. Analogous to previously known univariate methods, multivariate versions of a Gaussian distribution fit and a Gaussian kernel dressing are used in the postprocessing step that is independent of any observations. The application of GLASSO as a parameterization of the inverse covariance matrices is a relatively new extension of these methods based on the estimation of sparse covariances. The results are noncalibrated multidimensional predictive probability density functions. The probabilistic verification applied includes univariate measures like the CRPS and the *β* score for analysis rank histograms. The multivariate aspect is assessed using the energy score, which is the natural, multivariate extension of the CRPS. The forecast exhibits predictability if its energy score is lower than the respective energy of the reference climatological forecast.

The proposed multivariate ensemble postprocessing methods applied to EPS from the TIGGE database for the one year period from July 2010 to June 2011 yield reasonable results although the respective multivariate distributions are not calibrated using observations. In the present case, the Gaussian distribution fit GDF and the ensemble kernel dressing EKD results do not differ considerably although we expect the EKD to be more flexible if applied to different temporal and spatial scales and regarding observational uncertainty. Both methods exhibit high partial correlations between lead days. In cases where GLASSO is applied to the estimated precision matrices, an increasing regularization parameter enforces decreasing amplitudes of the partial correlations, which in turn leads to increased variances. The selection of an adequate regularization parameter for GLASSO offers the possibility to increase the variance and thus the spread of the ensemble forecast. This effect is evident through an increased *β* score for the respective PIT histograms. Although divergence (and reliability) is increased, it is accompanied by an increased forecast entropy (and resolution), as obtained by the decomposition of the energy score. In general, we regard GLASSO as a valuable estimate of sparse covariance matrices. Additionally, GLASSO might be used to calibrate the predictive density by optimizing the regularization parameter with respect to observations.

The application of the presented methods to four main global EPS forecasts with focus on station forecasts over the lead time reveals an overall large similarity between the four EPS. In general, calculating the CRPS of forecasts and the climatological reference for both examined parameters (i.e., near-surface temperature and pressure anomaly), for eight synoptic stations in Germany shows probabilistic skill over at least 9 lead days, the temperature forecasts even longer over almost 16 days. The multivariate verification in terms of an energy skill score relative to climate indicates a slight advantage in predictability being achieved by the ECMWF EPS forecasts. Nevertheless, given the day-to-day variability of the energy score function, no statistical valid statement can be given presently except that most changes of predictability are visible through the divergence between forecast and observation. The forecast entropy that is related to forecast sharpness is less fluctuating over the sample period.

Furthermore, the additional benefit of multivariate approaches can be demonstrated by the energy score under inclusion/exclusion of certain dependencies such as spatial correlations. For the datasets used in this study, the verification of probabilistic forecasts of spatiotemporal fields including all correlations provides better scores than the verification that excludes these correlations. For future work, other multivariate verification methods or scores are needed in order to evaluate the multivariate aspect in more detail (e.g., in terms of known physical processes). Additionally, the discrimination ability of the energy score has to be addressed.

For further work we encourage the application and development of multivariate probabilistic postprocessing methods, since the multivariate PDF contains spatiotemporal dependencies, which are an important element of the physical basis of weather forecasts. In conclusion, temporal multivariate probabilistic postprocessing of medium-range weather forecasts uncovers new insights in the understanding of the behavior of ensemble forecasts and allows for an assessment of integral predictability of time sequences of forecasts.

## Acknowledgments

This research was carried out in the Hans-Ertel Centre for Weather Research. This research network of universities, research institutes, and the Deutscher Wetterdienst is funded by the BVMI (Federal Ministry of Transport and Digital Infrastructure). We thank Deutscher Wetterdienst for the provision of the observational data used in this study. The TIGGE data have been accessed through the European Centre for Medium-Range Weather Forecasts (ECWMF) data portal. The authors would also like to credit the contributors of the R Project.

## REFERENCES

Anderson, J., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations.

,*J. Climate***9**, 1518–1530, doi:10.1175/1520-0442(1996)009<1518:AMFPAE>2.0.CO;2.Anderson, T. W., 1958:

*An Introduction to Multivariate Statistical Analysis.*Vol. 2. Wiley, 374 pp.Baringhaus, L., and C. Franz, 2004: On a new multivariate two-sample test.

,*J. Multivar. Anal.***88**, 190–206, doi:10.1016/S0047-259X(03)00079-4.Berkowitz, J., 2001: Testing density forecasts, with applications to risk management.

,*J. Bus. Econ. Stat.***19**, 465–474, doi:10.1198/07350010152596718.Bougeault, P., and Coauthors, 2010: The THORPEX Interactive Grand Global Ensemble.

,*Bull. Amer. Meteor. Soc.***91**, 1059–1072, doi:10.1175/2010BAMS2853.1.Bröcker, J., 2009: Reliability, sufficiency, and the decomposition of proper scores.

,*Quart. J. Roy. Meteor. Soc.***135**, 1512–1519, doi:10.1002/qj.456.Bröcker, J., and L. A. Smith, 2008: From ensemble forecasts to predictive distribution functions.

,*Tellus***60A**, 663–678, doi:10.1111/j.1600-0870.2008.00333.x.Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable.

,*Quart. J. Roy. Meteor. Soc.***131**, 2131–2150, doi:10.1256/qj.04.71.Chervin, R. M., 1981: On the comparison of observed and GCM simulated climate ensembles.

,*J. Atmos. Sci.***38**, 885–901, doi:10.1175/1520-0469(1981)038<0885:OTCOOA>2.0.CO;2.DelSole, T., and M. K. Tippett, 2007: Predictability: Recent insights from information theory.

,*Rev. Geophys.***45**, RG4002, doi:10.1029/2006RG000202.Friedman, J., T. Hastie, and R. Tibshirani, 2008: Sparse inverse covariance estimation with the lasso.

,*Biostatistics***9**, 432–441, doi:10.1093/biostatistics/kxm045.Friedman, J., T. Hastie, and R. Tibshirani, 2011: glasso: Graphical lasso-estimation of Gaussian graphical models, version 1.7. R package. [Available online at http://CRAN.R-project.org/package=glasso.]

Gneiting, T., and A. E. Raftery, 2007: Strictly proper scoring rules, prediction, and estimation.

,*J. Amer. Stat. Assoc.***102**, 359–378, doi:10.1198/016214506000001437.Gneiting, T., A. E. Raftery, A. H. Westveld III, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133**, 1098–1118, doi:10.1175/MWR2904.1.Gneiting, T., L. I. Stanberry, E. P. Grimit, L. Held, and N. A. Johnson, 2008: Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds.

,*Test***17**, 211–235, doi:10.1007/s11749-008-0114-x.Grimit, E. P., T. Gneiting, V. J. Berrocal, and N. A. Johnson, 2006: The continuous ranked probability score for circular variables and its application to mesoscale forecast ensemble verification.

,*Quart. J. Roy. Meteor. Soc.***132**, 2925–2942, doi:10.1256/qj.05.235.Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts.

,*Mon. Wea. Rev.***129**, 550–560, doi:10.1175/1520-0493(2001)129<0550:IORHFV>2.0.CO;2.Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta-RSM short-range ensemble forecasts.

,*Mon. Wea. Rev.***125**, 1312–1327, doi:10.1175/1520-0493(1997)125<1312:VOERSR>2.0.CO;2.Hasselmann, K., 1993: Optimal fingerprints for the detection of time-dependent climate change.

,*J. Climate***6**, 1957–1971, doi:10.1175/1520-0442(1993)006<1957:OFFTDO>2.0.CO;2.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15**, 559–570, doi:10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2.Jolliffe, I. T., and D. B. Stephenson, 2011:

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science*. 2nd ed. John Wiley & Sons, 292 pp.Jonko, A. K., A. Hense, and J. J. Feddema, 2010: Effects of land cover change on the tropical circulation in a GCM.

,*Climate Dyn.***35**, 635–649, doi:10.1007/s00382-009-0684-7.Keller, J. D., and A. Hense, 2011: A new non-Gaussian evaluation method for ensemble forecasts based on analysis rank histograms.

,*Meteor. Z.***20**, 107–117, doi:10.1127/0941-2948/2011/0217.Mazumder, R., and T. Hastie, 2012: The graphical lasso: New insights and alternatives.

,*Electron. J. Stat.***6**, 2125–2149, doi:10.1214/12-EJS740.Meinshausen, N., and P. Bühlmann, 2006: High-dimensional graphs and variable selection with the lasso.

,*Ann. Stat.***34**, 1436–1462, doi:10.1214/009053606000000281.Menéndez, P., Y. A. I. Kourmpetis, C. J. F. ter Braak, and F. A. van Eeuwijk, 2010: Gene regulatory networks from multifactorial perturbations using Graphical Lasso: Application to the DREAM4 challenge.

*PLoS ONE,***5,**e14147, doi:10.1371/journal.pone.0014147.Möller, A., A. Lenkoski, and T. L. Thorarinsdottir, 2013: Multivariate probabilistic forecasting using ensemble Bayesian model averaging and copulas.

,*Quart. J. Roy. Meteor. Soc.***139**, 982–991, doi:10.1002/qj.2009.Murphy, A. H., 1973: A new vector partition of the probability score.

,*J. Appl. Meteor.***12**, 595–600, doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality.

,*Mon. Wea. Rev.***119**, 1590–1601, doi:10.1175/1520-0493(1991)119<1590:FVICAD>2.0.CO;2.Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting.

,*Wea. Forecasting***8**, 281–293, doi:10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification.

,*Mon. Wea. Rev.***115**, 1330–1338, doi:10.1175/1520-0493(1987)115<1330:AGFFFV>2.0.CO;2.Murphy, A. H., and E. S. Epstein, 1989: Skill scores and correlation coefficients in model verification.

,*Mon. Wea. Rev.***117**, 572–582, doi:10.1175/1520-0493(1989)117<0572:SSACCI>2.0.CO;2.Packard, N. H., J. P. Crutchfield, J. D. Farmer, and R. S. Shaw, 1980: Geometry from a time series.

,*Phys. Rev. Lett.***45**, 712–716, doi:10.1103/PhysRevLett.45.712.Paeth, H., R. Girmes, G. Menz, and A. Hense, 2006: Improving seasonal forecasting in the low latitudes.

,*Mon. Wea. Rev.***134**, 1859–1879, doi:10.1175/MWR3149.1.R Core Team, 2013: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Available online at http://www.R-project.org/.]

Röpnack, A., A. Hense, C. Gebhardt, and D. Majewski, 2013: Bayesian model verification of NWP ensemble forecasts.

,*Mon. Wea. Rev.***141**, 375–387, doi:10.1175/MWR-D-11-00350.1.Schölzel, C., and A. Hense, 2011: Probabilistic assessment of regional climate change in Southwest Germany by ensemble dressing.

,*Climate Dyn.***36**, 2003–2014, doi:10.1007/s00382-010-0815-1.Schuhen, N., T. L. Thorarinsdottir, and T. Gneiting, 2012: Ensemble model output statistics for wind vectors.

,*Mon. Wea. Rev.***140**, 3204–3219, doi:10.1175/MWR-D-12-00028.1.Silverman, B. W., 1986:

*Density Estimation for Statistics and Data Analysis.*Chapman and Hall, 176 pp.Stenseth, N. C., G. Ottersen, J. W. Hurrell, A. Mysterud, M. Lima, K. S. Chan, N. G. Yoccoz, and B. Adlandsvik, 2003: Studying climate effects on ecology through the use of climate indices: The North Atlantic Oscillation, El Niño Southern Oscillation and beyond.

*Proc. Roy. Soc. London,***270B,**2087–2096, doi:10.1098/rspb.2003.2415.Stephenson, D. B., and F. J. Doblas-Reyes, 2000: Statistical methods for interpreting Monte Carlo ensemble forecasts.

,*Tellus***52A**, 300–322, doi:10.1034/j.1600-0870.2000.d01-5.x.Takens, F., 1981: Detecting strange attractors in turbulence.

*Dynamical Systems and Turbulence, Warwick 1980,*D. Rand and L.-S. Young, Eds., Lecture Notes in Mathematics, Vol. 898, Springer, 366–381.Tsyplakov, A., 2013: Evaluation of probabilistic forecasts: Proper scoring rules and moments. Novosibirsk State University, Dept. of Economics Working Papers, 21 pp. [Available online at http://ssrn.com/abstract=2236605.]

von Storch, H., and F. W. Zwiers, 1999:

*Statistical Analysis in Climate Research.*Cambridge University Press, 484 pp.Whitaker, J. S., and A. F. Loughe, 1998: The relationship between ensemble spread and ensemble mean skill.

,*Mon. Wea. Rev.***126**, 3292–3302, doi:10.1175/1520-0493(1998)126<3292:TRBESA>2.0.CO;2.Wilks, D. S., 2002: Smoothing forecast ensembles with fitted probability distributions.

,*Quart. J. Roy. Meteor. Soc.***128**, 2821–2836, doi:10.1256/qj.01.215.