## Abstract

Proper scoring rules provide a theoretically principled framework for the quantitative assessment of the predictive performance of probabilistic forecasts. While a wide selection of such scoring rules for univariate quantities exists, there are only few scoring rules for multivariate quantities, and many of them require that forecasts are given in the form of a probability density function. The energy score, a multivariate generalization of the continuous ranked probability score, is the only commonly used score that is applicable in the important case of ensemble forecasts, where the multivariate predictive distribution is represented by a finite sample. Unfortunately, its ability to detect incorrectly specified correlations between the components of the multivariate quantity is somewhat limited. In this paper the authors present an alternative class of proper scoring rules based on the geostatistical concept of variograms. The sensitivity of these variogram-based scoring rules to incorrectly predicted means, variances, and correlations is studied in a number of examples with simulated observations and forecasts; they are shown to be distinctly more discriminative with respect to the correlation structure. This conclusion is confirmed in a case study with postprocessed wind speed forecasts at five wind park locations in Colorado.

## 1. Introduction

During the last two decades a paradigm shift has occurred in the practice of numerical weather prediction (NWP). To account for the various sources of uncertainty in the NWP model output, ensemble prediction systems were developed and have now become state of the art in meteorological forecasting (Buizza et al. 2005; Lewis 2005; Leutbecher and Palmer 2008). Those ensemble forecasts aim to represent the range of possible outcomes, and probabilistic statements like the probability of exceeding a certain amount of precipitation can be derived from them and help making informed decisions.

Along with the availability of probabilistic forecasts comes the need for both diagnostic and quantitative methods to assess the quality of those forecasts and to compare the performance of competing forecasters. A probabilistic forecast should be calibrated (i.e., statistically consistent with the values that materialize) and sharp (i.e., very specific about the anticipated weather; Gneiting et al. 2007). Sharpness can be assessed via numerical and graphical summaries of the width of the prediction intervals that come with a predictive probability distribution. The notion of calibration is more complex, and different types of calibration have been established. Marginal calibration measures the similarity of the aggregated predictive distribution and the climatological distribution of the predictand, and can be checked by comparing the average predictive cumulative distribution function (CDF) with the empirical CDF of the observations (Gneiting et al. 2007). Probabilistic calibration concerns the dynamical aspects of probabilistic forecasts and can be assessed by studying verification rank histograms (Anderson 1996; Hamill and Colucci 1997; Hamill 2001).

To make a quantitative comparison of different forecast methods, summary measures of their predictive performance are required. Those measures should take both calibration and sharpness into account. To this end, scoring rules have been proposed which assign a numerical score to each pair where *F* is the CDF of the predictive distribution and *y* is the realized value. If we take scoring rules to be negatively oriented, can be viewed as a penalty that the forecasters wish to minimize. A crucial property that one should always require from a scoring rule is that it is *proper*, which is formally defined by the requirement

where denotes the expected score of the forecast CDF *F* when the verifying observations *y* are realizations of a random variable *Y* with CDF *G*, and means “for all.” The score is *strictly proper* if the equality holds only if (Gneiting and Raftery 2007). Using only proper scoring rules is important in practice because the above inequality implies that a forecaster who knows the true distribution *G* has no incentive to predict any , and is encouraged to quote her true belief. It has been demonstrated that the use of improper scores can lead to misguided inferences about predictive performance (Gneiting 2011).

The notions and methods mentioned above refer to probabilistic forecasts of univariate quantities. In some applications, however, multivariate quantities are of interest where multivariate can either refer to several different weather variables, or to a single variable considered at different locations in space or points in time simultaneously. River basin streamflow forecasts, for example, rely heavily on the meteorological inputs, and the runoff of mountain streams in spring season depends on both temperature (because of its impact on the amount of meltwater) and precipitation amounts. It is therefore important to know if an observed temperature above the predictive mean is likely to be associated with observed precipitation amounts above the predictive mean. If there is a positive or negative association between those two variables it should be reflected by the joint probabilistic forecast. Moreover, simultaneous consideration of all locations in the river basin and several lead times may be required. A recent article by Wilks (2014) considers probabilistic forecasting of heat waves, which requires the simultaneous study of minimum temperature and dewpoint temperature at two consecutive days, and Feldmann et al. (2015) study statistical postprocessing methods that yield calibrated temperature forecasts simultaneously at several locations. A number of multivariate generalizations of the verification rank histogram have been proposed (Smith and Hansen 2004; Wilks 2004; Gneiting et al. 2008; Thorarinsdottir et al. 2015; Ziegel and Gneiting 2014) that are sensitive to misrepresentations of both univariate characteristics and correlations between the different components of the multivariate quantity under consideration.

As far as proper scoring rules are concerned, the forecast verification toolbox is still rather limited. On the one hand there is the energy score (ES) and generalizations of it (Gneiting and Raftery 2007):

where and are independent random vectors that are distributed according to the multivariate CDF *F* and is the Euclidean norm. The energy score has the appealing property that it generalizes the univariate continuous ranked probability score (CRPS; Hersbach 2000) and is readily applicable also to ensemble forecasts. It has been pointed out, however, that this score is often not sufficiently sensitive to misspecifications of the correlations between the different components (Pinson and Girard 2012; Pinson and Tastu 2013). This is a big drawback since unlike the means and variances those correlations cannot be studied by applying univariate scores to the individual components. On the other hand, there are scoring rules [e.g., the logarithmic score by Roulston and Smith (2002), applied to a multivariate probability density function] that are more sensitive to misspecified correlations, but require that the forecast is given in terms of a predictive density, and are thus not applicable in the important case of ensemble forecasts. Dawid and Sebastiani (1999) proposed some multivariate scoring rules that depend only on the mean vector and the covariance matrix of the predictive distribution *F*. A particularly appealing example is the scoring rule [hereafter referred to as the Dawid–Sebastiani score (DSS)]:

It is equivalent to the logarithmic score for multivariate Gaussian predictive distributions and remains a proper (though not strictly proper) score relative to the larger class probability distributions for which the second moments of all components are finite (Gneiting and Raftery 2007). In principle this score could be applied to empirical versions of and that were estimated from an ensemble, but unless the sample size is much larger than the dimension of the multivariate quantity, sampling errors can have disastrous effects on the calculation of and , and render this score useless in the context of ensemble forecasting (see e.g., Table 2 in Feldmann et al. 2015). Accordingly, in section 2 we propose a new, proper, multivariate score that is based on pairwise differences between all components of the multivariate quantity and that we hypothesize is more readily usable for ensemble forecast diagnosis. Some simulation examples are presented in section 3. These will demonstrate that this new score is sensitive to misspecified correlations between the different components, and that it is useful for ensemble forecast diagnosis even when the number of ensemble members is moderate. An application of the new score in the context of probabilistic wind speed forecasting at several locations in Colorado simultaneously is presented in section 4, before we conclude with a short discussion in section 5.

## 2. A scoring rule based on pairwise differences

The basic idea of the class of multivariate scoring rules proposed in the following is to consider pairwise differences of the components of the multivariate quantity of interest. This has already been suggested in the context of rank histograms (e.g., Fig. 5 in Hamill 2001) and recently been utilized by Feldmann et al. (2015) in a diagnostic plot to check the adequacy of a statistical model for spatial correlations. Denote by the vector of observations, by its *i*th component, and assume that is a realization of the random vector . Adopting the concept of a *variogram* (also referred to as *structure function*) from geostatistics we study the quantity

where *E* denotes the expectation under the (multivariate) distribution of **Y**, which is assumed to have finite second moments. Denoting and we have

which shows that depends not only on the first and second moments of the individual components, but also on their correlations. More generally, one can consider variograms of order :

The special cases and are known as *madogram* and *rodogram*, respectively (Bruno and Raspa 1989; Emery 2005). Variograms of order *p* can be defined for any multivariate distribution for which the *p*th absolute moments exist. For and non-Gaussian distributions they can usually not be expressed as simple functions of the means, variances, and correlations of and , but they still depend on all of those quantities, and are therefore potentially useful for comparing the multivariate dependence structure of forecasts and observations. While condensing the information about the dependence of and into a single number implies a certain loss of information, we shall see that utilizing these quantities in the framework of scoring rules results in a performance measure that is sensitive to various types of miscalibration of multivariate forecasts. For a given *d*-variate observation vector and forecast distribution *F* we define the *variogram score of order p* (VS-*p*):

where and are the *i*th and the *j*th component of a random vector that is distributed according to *F*, and are nonnegative weights. The score measures the dissimilarity between approximations of the variograms of order *p* of observations and forecasts over all pairs of components of the quantity of interest. For the observations, our best guess of is simply the powered absolute difference of and . When the forecast distribution is given in the form of an ensemble , the forecast variogram can be approximated by

Pairs of squared variogram differences can be emphasized or downweighted through the choice of the weights. This might be motivated by a subjective decision of an expert to put focus on certain component combinations. In a spatial context, for example, the possibility of emphasizing differences corresponding to pairs of locations that are either close by or a certain distance apart is related to the idea of scale-dependent verification (e.g., Jung and Leutbecher 2008). Downweighting certain pairs can also help mitigating the effects of sampling error. To see this, assume for simplicity that the random vector follows a multivariate Gaussian distribution with identical mean in all components. Defining we then have

This shows that in both cases, both magnitude and variability of pairs of weakly correlated components are higher than for strongly correlated components. The former would therefore dominate the VS-*p* on the one hand, and introduce more variability on the other hand, which implies that downweighting pairs that are expected to have relatively weak correlations can benefit the signal-to-noise ratio. In situations where there is some notion of distance between the *i*th and *j*th component (e.g., time lag as in the examples in section 3 or spatial distance as in section 4), correlations at short distances are typically stronger than those at longer distances. As a pragmatic ad hoc choice of the weights we then suggest to let them be proportional to the inverse distances between the corresponding components. This idea of downweighting certain pairs of components is conceptually similar to covariance localization in data assimilation (Houtekamer and Mitchell 2001; Hamill et al. 2001), where elements in the empirical covariance matrix that correspond to conceivably weakly or uncorrelated components are tapered down toward zero to reduce the effects of sampling error. When the multivariate quantity consists of variables of different type (e.g., temperature, pressure, and relative humidity), there is no obvious notion of distance and even the definition of seems doubtful as we would be subtracting quantities with potentially different units. In that situation, one could apply to standardized components:

where and are the climatological mean and variance of the variables, respectively. This approach has been suggested in multivariate geostatistics in the context of variance-based cross variograms, which are the equivalent of our score in the situation where components can correspond to different variables. In the geostatistical context it can be justified by the fact that predictors derived from variance-based cross variograms do not depend on the particular unit, and so the user should work with standardized variables in order to minimize the effects of sampling error (Cressie and Wikle 1998). In some applications there might be better, more problem-specific meteorological concepts to transform weather variables of different type in a way that brings them all to a scale in which they can be compared, one example being the total-energy norm (e.g., Hamill et al. 2003).

We now show that is proper relative to the class of the probability distributions for which the th moments of all components are finite. To see this, consider first a single pair . For any such pair, the mean of the random variable minimizes the expected squared deviation of *Z* from any fixed number , that is,

This means that the inequality in (1) holds separately for any pair , but then it also holds for the weighted sum over all pairs, for any choice of nonnegative weights. Note, however, that the VS-*p* is not strictly proper because it only depends on the *p*th absolute moment of the distribution of component differences, and can therefore not distinguish between distributions of *Z* that have the same *p*th absolute moment but different higher moments. Moreover, large-scale random errors that are the same for all components cancel out when differences are considered; likewise, a bias that is the same for all components will go undetected. The simulation study in section 3 shows, however, that for suitable choices of *p* the VS-*p* is quite sensitive to misspecifications of the correlation structure of . More importantly, this is still true when has to be estimated as in (4) from an ensemble that represents the predictive distribution *F*. This approximation introduces quite a bit of additional sampling error, but the effects on the score’s propriety and discrimination ability will be shown to be much less severe as for the Dawid–Sebastiani score. This makes the VS-*p* a favorable score in the context of ensemble forecasting, on which we focus in the rest of this paper.

Before comparing it with the ES and DSS in simulations, we shall mention that the VS-*p* can be viewed as a special case of a much larger class of scoring rules. Consider the mapping defined by

where is the weight vector of the transformation . Choosing , we can rewrite the VS-*p* from (3) as

which shows that the VS-*p* of a single, multivariate forecast is (up to the factor ) the same as the mean squared error (MSE) over the components of the transformed forecast vector. The generalization of the VS-*p* is now obvious: instead of the MSE, we can apply any other univariate scoring rule to the components of and , and take the mean over the resulting values as an alternative score for our multivariate quantity. Or, we can apply the ES to the -variate vectors and , rather than to and directly. These generalizations will also be studied in the subsequent section.

## 3. Simulation study

We compare the energy score, the Dawid–Sebastiani score, and the variogram score of order and 2, using inverse distance weights as described above. In all experiments we generate observation vectors of dimension *d*, and an *m*-member ensemble of forecast vectors of the same dimension with both correct and misspecified means, variances, or correlations. To understand the impact of representing the predictive distribution by an ensemble on the different scores, we consider both small () and medium-sized () ensembles. While a formal definition of being *proper* exists and allows one to check this property mathematically, there does not seem to be a commonly accepted measure of a scoring rule’s ability to discriminate between calibrated and uncalibrated forecasts. This is an important characteristic though that determines its utility for forecast verification in practice. In this simulation study, we try to get some sense of the discrimination ability of the various scores by repeating each experiment 10 times and visualizing the respective outcomes by boxplots. Even though the scores are averaged over 5000 cases, they still vary from one experiment to another. If the group of average scores obtained with calibrated forecasts is clearly separated from the one obtained with uncalibrated forecasts, we will interpret this as good discrimination ability of the scoring rule that was utilized. Conversely, if there is a strong overlap of the ranges of outcomes obtained with calibrated and uncalibrated forecasts, we will conclude that the scoring rule that produced these outcomes cannot reliably detect this particular type of miscalibration.

### a. Miscalibrated marginal distributions

Although we contend that multivariate verification should focus on the correlations between the different components (predictive means and variances can be compared in a first step with univariate verification techniques), we shall start with a first experiment that compares the different scores with respect to their ability to detect biases and over- or underdispersion of the forecasts. We already noted that the VS-*p* is unable to detect a bias that is the same for all components, but we can consider a situation where this simple type of bias has been removed while an erroneous trend is present in the forecast means. Specifically, let the observation vectors be realizations of a Gaussian random vector **Y** of dimension with zero mean, unit variance, and correlation function:

In this experiment we take . If we associate each component with a time point, *Y* can be viewed as a short, stationary autoregressive AR(1) process. Note that the definitions of all scores studied here neither exploit nor rely on this property of stationarity. Moreover, since the scores are calculated separately for each of the 5000 cases and averaged only afterward, they can also be applied in situations where the distribution of the observation vector differs from one case to another. The possibility of exploiting preliminary knowledge about the multivariate dependence structure is further discussed in the second example below. To compare the sensitivity of the different scores to misspecifications of means and variances, we generate forecasts with the same exponential correlation function as above and

correct variances but biased means: ;

correct means and variances;

correct means but too large variances: ; and

correct means but too small variances: .

The corresponding boxplots are shown in Fig. 1. We note first of all that the influence of ensemble size is rather different from one score to another. For the ES, there is hardly any difference between with . This can be an advantage if only an ensemble of very small size is available, but it also suggests that the ES cannot distinguish a very good representation of the predictive distribution *F* from a very sparse one. This is different for the VS-*p* values, which consistently improve with increasing ensemble size, thus showing that the finite sample representation of *F* does have a noticeable effect on the score. This sampling effect, however, does not change the qualitative conclusions about the predictive performance of the different forecasts (this is also true for the examples considered below). A really substantial change of the scores due to the different finite representations of the predictive distribution can be observed with the DSS (note the different scales for and ). The approximation of and by empirical means and covariances estimated from the small ensemble is so poor that the resulting scores lead to false conclusions about predictive performance, favoring the overdispersive ensemble over the calibrated one. For the larger ensemble, this score bias due to insufficient representation of *F* plays a smaller role, and the DSS discriminates well between the correct and uncalibrated forecasts. The ES is very effective in detecting the erroneous linear trend corresponding to the forecasts simulated according to 1), but the separation between the calibrated and over/underdispersive forecasts is less distinct. Among the different VS-*p* studied here, the VS-*p* with has clearly the best discrimination ability. It identifies the miscalibration of the mean less clearly than the ES, but is more effective in detecting over and underdispersiveness. The VS-*p* with still detects all types of miscalibration reasonably well. It is noticeable, however, that with increasing *p* the random variations between scores obtained with identical setups become larger and larger and blur the systematic differences between calibrated and uncalibrated forecasts. Before we turn to the genuinely multivariate aspects we would like to recall that the VS-*p* is not *strictly* proper. In the present situation, for example, the effects of an erroneous trend and underdispersion can cancel out [for this can be seen directly from (2)]. We therefore emphasize again that an analysis of the marginal distributions by means of univariate scores should precede the study of multivariate properties.

### b. Misspecified correlation strength

In our second experiment we focus on the correlation structure of the multivariate quantity under consideration. We study the ability of the different scores to detect whether the correlations between the different components of the forecast vectors are too weak, adequate, or too strong compared to the corresponding correlations of the observation vectors. Moreover, we study the effect of increasing the dimension from to on the different scores. In both cases, we consider again a zero mean, unit variance AR(1) process with correlation function given in (5). For the observation vectors, we choose as before and compare ensemble forecasts simulated with the same correlation model, but , and . The boxplots in Fig. 2 for the ES confirm the conclusion of Pinson and Tastu (2013) that the ES can hardly discriminate multivariate forecasts that differ only with respect to their correlations between individual components. For the DSS the conclusion is as in the first experiment. It discriminates well between calibrated and uncalibrated forecasts if the ensemble that represents the predictive distribution is sufficiently large. A small ensemble, however, results in an inaccurate approximation of and , and the corresponding DSS leads to misguided inference. This representation issue is much less severe for the VS-*p*, and for and it discriminates well between correct and incorrect correlation strengths. For the discrimination ability is still better than for the ES but overall not very satisfactory with random differences between identical setups having the same magnitude as systematic score differences due to miscalibration. Increasing the dimension from to has a slightly negative effect on the discrimination ability of the VS-*p*. This may be somewhat surprising since a larger dimension entails more data that are used for the calculation of . However, since our definition of the VS-*p* in (3) does not make any assumption (e.g., stationarity in a time series or spatial context) about the correlation structure of forecasts and observations, increasing the number of summands in (3) does *not* lead to an averaging of sampling error. If one was absolutely sure that some additional structural assumption is justified [i.e., that the set of all pairs can be represented as a union of disjoint subsets such that the component differences corresponding to the pairs in each subset have the same *p*th absolute moment], one could replace definition (3) by

This way, additional structural information could be exploited and an increase of *d* would then likely reduce sampling error and improve the discrimination ability of the score. In the present example, the simulated AR(1) process is stationary and proceeding as described above with would be justified. In general, however, such information is not available, and while simplifying assumptions are common and appropriate in statistical modeling, we contend that verification methods should avoid unwarranted preliminary assumptions about forecasts and observations as far as possible. We therefore recommend retaining the definition in (3), even though it is less favorable with respect to the VS-*p*’s discrimination ability. The fact that the discrimination ability in the present example even gets slightly worse from to can probably be explained by the fact that the fraction of pairs of components in with rather weak correlations increases, and thus more variability is introduced into the calculation of the score.

### c. Misspecified correlation model

In the third experiment, we vary the entire correlation model rather than just the correlation strength. We now consider only the case and simulate observations with zero mean, unit variance, and correlation function:

and

.

Both of them yield correlations at lag 1 that are very similar to the exponential model in (5) with . Model (i), however, has much stronger correlations at larger lags, and model (ii) has a periodic component that makes it oscillate around this exponential reference model. Can the VS-*p* detect those differences between the model in (5) and models (i) and (ii), respectively, even though our proposed weighting scheme downweights larger lags? Figure 3 confirms many of the conclusions from the preceding experiment. The ES again lacks sensitivity to misspecifications of the correlation structure while the VS-*p* distinguishes much better between the correct and the incorrect correlation model. Again, however, the discrimination ability depends on *p*, with smaller values yielding significantly better results. The DSS has similar issues in this example as in those discussed above. Their magnitude drops dramatically when passing from 20 to 100 ensemble members, although the underlying multivariate distribution is the same. In the case where the observations have long-range dependence, both ensemble sizes are insufficient to reduce this score’s representation bias enough to yield the proper ranking between correct and incorrect forecasts. In the example with the oscillating correlation model, the DSS yields the correct ranking and separates the two cases very well. However, it may well be that this is simply an example where the bias due to the finite representation of the predictive distribution favors the correct ranking by chance.

### d. Misspecified generating process

When we introduced the VS-*p* in section 2, we emphasized that this family of scoring rules is proper, but not strictly proper. It is based only on the *p*th absolute moment of differences between all pairs of components. It is clear that biases that are the same for all components cancel out. It is also clear that certain combinations of misspecifications (e.g., overestimation of correlation strength and overestimation of marginal variances) can partially or fully cancel out. But even if it has been assured that the marginal distributions are calibrated, the *p*th absolute moment of component differences does in general not fully characterize the multivariate dependence. How good is the VS-*p* in distinguishing forecasts that are entirely correct (i.e., have been generated by the same process as the observations) from forecasts that have correct means, variances, *and* correlations, but have been generated by a completely different mechanism? It can be expected that the answer depends on the particular generating process, and we are careful to make general claims as to this issue. Yet it is instructive to study at least one such example. We simulate observations as follows:

Draw a random number

*υ*from a Poisson distribution with parameter .Draw

*υ*locations from a uniform distribution on the interval .Denoting by the maximum of 0 and the function in brackets, define

One can think of as storm centers that have an influence on all locations within a radius of one unit, expressed by the influence function . The different local storms are then added up to the final outcome. This process is a special case of a so-called *shot noise process*. Using results from Matérn (1986, chapter 3.3), one can show that with the specific choices made above is a sample of a stationary time series with mean , variance 1, and correlation function:

We now compare forecasts that were generated in the same way as this shot noise observation process with forecasts that have the same means, variances, and correlations, but were simulated from a multivariate Gaussian distribution. An illustration of one sample path, respectively, on the full interval is provided in the supplemental material to this paper. The results of this comparison are depicted in Fig. 4. A few conclusions are very consistent with what we already observed before. The discrimination ability of the ES is rather poor, and the DSS favors the incorrect model as a result of insufficient approximation of and , even in the case where . Recall that the DSS depends on the predictive distribution only through its component means and variances, and intercomponent correlations, so for a perfect approximation of and we would expect the DSS to be indifferent toward the particular forecast generation process. The same is true for the VS-2, while the effect of the generation process on the VS-1 and VS-0.5 is not quite as obvious. For the first time, we observe problems related to the finite sample representation of the predictive distribution also with the VS-2 and VS-1. The good discrimination ability of the VS-0.5 may be based on several factors. On the one hand, the 0.5th absolute moment of differences seems to be very informative about the generating process. It is not clear though, whether this is specific to the present example or whether this is true in general. On the other hand, we have already observed that the choice entails less sampling variability compared to larger values, and this likely contributes to the favorable performance of the VS-0.5 in the present example as well.

### e. Sensitivity of the variogram score of order p to the choice of weights

So far, we have always chosen the weights in (3) proportional to the inverse distance between the components. We have argued in section 2 that such a choice is reasonable whenever there is some natural notion of distance, and correlations between components are expected to decrease with this distance. Yet, this choice is quite ad hoc, and it is natural to ask how sensitive the discrimination ability of the VS-*p* is with regard to the choice of weights, and if other choices yield a similar or even better performance. To answer this question, we repeat the first two experiments, this time considering only the case where and . We restrict our attention to the VS-0.5, but study two alternative weighting schemes: no weighting at all (i.e., ) and a kind of localization scheme where (i.e., pairs of components more than three units apart are not considered at all). The results in Fig. 5 are as one might have expected. Misspecifying the range parameter in our exponential correlation model in (5) affects correlations between all pairs of components. As pointed out in section 2, close by, strongly correlated components have a more favorable signal to noise ratio, and so it is not surprising that the localization weighting scheme has the best, and the unweighted VS-0.5 has the worst discrimination ability. The same conclusion holds in the experiment where the correlation function of the observations has a periodic component. Even at short lags, this correlation functions differs quite strongly from the simple exponential model, and focusing on close-by component pairs, therefore, benefits the score’s discrimination ability. Differences between the long-range correlation model and the exponential model, on the contrary, are more noticeable for pairs of components that are farther apart, and hence the unweighted VS-0.5 performs best. Overall, we conclude that if prior knowledge about correlations is available, some sort of localization scheme with appropriately chosen cutoff radius should be used. In the absence of such knowledge, the inverse distance weighting scheme seems to be a good compromise. We finally note that even the unweighted score permits better identification of misspecified dependence structures than the ES.

### f. Generalizations of the variogram score of order p

At the end of section 2, we pointed out that the VS-*p* defined in (3) can be viewed as a special case of a larger class of scoring rules which transforms both forecast and observation vectors to -dimensional vectors of weighted, powered, absolute differences between the components of the original vectors. Here, we fix and define the weight vector of the transformation through . With these choices, the VS-0.5 with inverse distance weights is (up to a constant factor) the same as the MSE of the componentwise means of the transformed forecasts with respect to the transformed observations. As alternative scores, we consider the mean absolute error (MAE) of the componentwise medians of the transformed forecasts, the mean continuous ranked probability score (MCRPS) over all components of the transformed forecasts, and the ES of the vector of transformed forecast. Figure 6 shows results for the setting of our correlation model experiment in section 3c with and , where the observation is generated according to section 3c model (i) and (ii), respectively, and the scores are used to distinguish correct forecasts from those that erroneously use an exponential correlation model. The main point to note is that all scores are able to distinguish the correct from the incorrect correlation model, showing that it is really the transformation , rather than the particular score applied to the transformed vectors, that is crucial for detecting misspecified dependence structures. With the MAE and MCRPS being particular discriminative in the example with long-range dependence and the ES faring best in the example with a periodic component, there is no clear ranking among the different scores. The MSE, the score that corresponds to the VS-0.5, demonstrates good discrimination ability in both examples. Its preference over the other options is by no means imperative, but it seems to be a good compromise, and thus a reasonable standard choice.

## 4. Evaluating multisite wind speed forecasts

We finally apply our score in a data example to evaluate and compare statistically calibrated, probabilistic forecasts of wind speeds at five major wind park locations in the state of Colorado. Specifically, we consider the period from 1 January to 31 December 2013, use 80-m wind speed forecasts from the second-generation GEFS reforecast dataset (Hamill et al. 2013) and the corresponding reanalyses for both calibration and verification. The reforecast ensemble has 11 members and was initialized once daily at 0000 UTC. We study 80-m wind speed predictions with lead times of 24, 48, and 72 h at the grid points that are closest to

Cedar Point Wind Farm (250 MW, operational since 2011);

Cedar Creek Wind Farms I and II (550 MW, operational since 2007/10);

Peetz Table Wind Energy Center (430 MW, operational since 2001/07);

Colorado Green Wind Farm (162 MW, operational since 2003); and

Cheyenne Ridge Wind Project (under development, project size 300–600 MW).

As explained above, the ensemble forecasts , where denotes the set of the five wind park locations, can be interpreted as a sample from the multivariate distribution that describes the simultaneous predictions. The raw model output, however, often suffers from systematic biases and typically fails to fully represent prediction uncertainty (Hamill and Colucci 1997). To calibrate the marginal predictive distributions, we follow Thorarinsdottir and Gneiting (2010) and fit a heteroscedastic regression model to past forecast–observation pairs that turns the ensemble mean and the ensemble variance at location *s* into a predictive truncated normal distribution,

for the observed wind speed at *s*. A separate model is fitted for each location, each forecast lead time, and each month of the verification period from 1 January to 31 December 2013. For each month, forecasts and observations from the same, the preceding, and the subsequent month in the years 2010, 2011, and 2012 are used as training data for the model fitting procedure [for details about that procedure we refer to Thorarinsdottir and Gneiting (2010)]. Once the parameters for each month, location, and lead time are determined, a predictive distribution for the day under consideration can be obtained by plugging the corresponding ensemble mean and variance into (7). Diagnostic plots (not shown here) confirm that the univariate probabilistic forecasts obtained in this way are calibrated (i.e., they are unbiased and represent the prediction uncertainty adequately).

The postprocessing scheme just described only addresses the marginal distributions. In our particular example, however, power network operators might be interested in whether low wind speeds (and hence low wind power production) at one wind park will be compensated by higher wind speeds at the other wind parks, or whether wind speeds will be low at all wind parks simultaneously. To account for this multivariate aspect of our prediction problem and address correlations between the forecasts at the different locations, we use the ensemble copula coupling (ECC) technique (Schefzik et al. 2013), which turns the five marginal predictive distributions back into an ensemble that has the same rank correlation structure as the original ensemble but calibrated margins. Specifically, if denotes the predictive, truncated normal CDF at location *s*, calibrated ensemble forecasts are obtained via

where *F*_{s}^{−1} is the predictive quantile function at *s* and . With other words, the original forecasts are replaced by quantiles (this particular way of sampling is referred to as ECC-Q) of the calibrated marginal distributions in such a way that the ordering of the ensemble member forecasts remains unchanged. In this way, the (flow dependent) rank correlation information of the raw GEFS ensemble is preserved.

Does this preservation of rank correlations really yield noticeably better multivariate forecasts than a sampling scheme in which is a random perturbation of the set (i.e., no spatial correlations) or one in which is the identity (i.e., maximal spatial dependence)? We compute those alternative, marginally calibrated ensembles (“random-Q,” “ordered-Q”) as well and use the ES, the VS-0.5, and the VS-1 to evaluate and compare the corresponding multivariate wind speed forecasts with those of the raw and ECC-Q ensemble. Again, we use inverse distance weights for the VS-*p* where distance is now the geographical distance (in kilometers) between the wind farm locations. Since in section 3 the empirical DSS turned out to be unreliable for small ensemble sizes and the VS-2 was always less discriminative than the VS-0.5 and VS-1, only the two latter are considered here as alternatives to the ES. In order to facilitate the comparison between the three different scores, we turn them into *skill* scores with respect to the raw ensemble. That is, instead of the energy score ES_{*} for method “*” we state the energy skill score ESS_{*} = 1 − ES_{*}/ES_{ens}, which measures the increase in predictive performance compared to the raw ensemble (likewise for the variogram scores). All skill scores in Table 1 agree that ECC-Q yields the most skillful, multivariate probabilistic forecasts. The ordered-Q ensemble, for which wind speeds are simultaneously low or high at all locations, is less skillful than the uncalibrated ensemble; the corresponding multivariate structure is clearly inappropriate. The comparison between ECC-Q and random-Q is more interesting, and confirms the above findings about the respective sensitivity of the ES and the VS-*p* to miscalibration. The ESS yields a somewhat clearer distinction between the raw and the ECC-Q ensemble, which differ in their marginal distribution, but have the same rank correlations. The random-Q ensemble, however, scores almost as well as the ECC-Q ensemble, despite its doubtful assumption of spatial independence. Under the VS-0.5 and VS-1, on the contrary, the random-Q ensemble fares distinctly worse than the ECC-Q ensemble, and has even negative skill for lead times larger than 24 h. Those two ensembles yield identical forecasts at each location individually, but their components have different rank correlations. Again, the VS-*p* can detect those differences more clearly.

## 5. Discussion

In their recent review on probabilistic forecasting, Gneiting and Katzfuss (2014, p. 146) note as one out of eight key issues for future research that

“There is a pressing need for the development of decision-theoretically principled methods for the evaluation of probabilistic forecasts of multivariate variables.”

When the focus is on the correlation structure and the mean and covariance matrix of the predictive distribution are given in closed form, the DSS is an excellent choice. The examples in section 3 show, however, that the usage of this score can be problematic when the probabilistic forecasts are represented by an ensemble of limited size, and empirical versions of the predictive mean vector and covariance matrix have to be used. In spite of being proper, the DSS can then lead to entirely wrong conclusions about predictive performance, which suggests that this scoring rule is far from being *fair* in the sense of Fricker et al. (2013). In this paper, we have presented a new class of multivariate scores based on powered differences between pairs of components of the multivariate quantity, denoted as variogram scores of order *p* (VS-*p*). In our simulation studies the VS-*p* was also negatively affected by the sampling error due to representing the predictive distribution by a (possibly small) ensemble. In the majority of cases, however, it led to the correct conclusions about predictive performance, which suggests that it is much closer to being *fair* than the DSS. Moreover, it is more successful than the ES in distinguishing forecasts with different correlation structures. Three different choices of powers *p* were studied for the VS-*p*, and it was found that the best results are obtained with , while was clearly suboptimal. Would a VS-*p* with have even better properties? At least for Gaussian predictive distributions, a square root transformation is likely already the best choice since the distribution of is almost perfectly symmetric and thus has much better sampling properties than the strongly skewed distribution that comes with the choice (Cressie and Hawkins 1980). If the predictive distribution itself is already skewed, however, then smaller powers may indeed be favorable to obtain a near-symmetric distribution of .

In section 4, we considered a data example with statistically postprocessed wind speed forecasts. Scoring rules in general, and the VS-*p* in particular, may however also be useful diagnostic tools in the development process of ensemble prediction systems. In the context of data assimilation, for example, it is important that the ensemble adequately represents the variances and covariances between different variables at different locations. Comparing different ensembles via scoring rules rather than empirical covariances (averaged over a certain time period) has the advantage that the former evaluate every time point separately and average the scores rather than covariances. This is more adequate if those covariances are flow dependent. Moreover, if the scores are normalized in a reasonable way (for the VS-*p* this could be done by requiring that the weights sum to one on each day), even the space dimension may change over time, and averaging the corresponding scores would still make sense. If the distribution of the observation errors is known, those can be taken into account by simulating a sample of such errors and adding them to the ensemble forecasts. The empirical version of the VS-*p* then becomes

and by choosing *M—*the number of simulated observation error vectors—large enough, one can reduce at least part of the additional variability that is introduced into the score. It remains to be seen if the signal-to-noise ratio in those applications is large enough for this score to be still sufficiently discriminative.

We think that the class of VS-*p* proposed here is a useful contribution to address the above-mentioned research issue of decision theoretically principled methods for multivariate forecast evaluation. It has certain limitations, resulting from the fact that is not *strictly* proper as discussed in section 2. Given the strong increase in the number of degrees of freedom with the dimension of the quantity to be forecast, it is unlikely, however, that there exists a single multivariate score that serves all purposes. We strongly recommend that several different scores be always considered before drawing conclusions. Some of the limitations of the VS-*p* can be addressed by studying the ES (which is more sensitive to misspecifications of the predictive mean and less affected by the finite representation of the predictive distribution) or univariate scores for the marginal distributions alongside with our VS-*p*. Focusing on differences between components is probably the most natural, but by no means the only possible transformation of the multivariate quantity that leads to a multivariate score that is sensitive to correlations between components. In some applications, studying composite quantities like minima, maxima, or averages over several locations or lead times (Berrocal et al. 2007; Feldmann et al. 2015), or indexes that involve multiple quantities (Wilks 2014) is a natural way to turn multivariate quantities into univariate ones that can be evaluated by standard univariate scores. This way, specific (and practically relevant) aspects of the multivariate predictive distribution can be evaluated, and this sort of verification is another recommended supplement to general purpose multivariate scores like the ES or the VS-*p* presented here.

## Acknowledgments

The authors thank Tilmann Gneiting, Martin Leutbecher, and two anonymous reviewers for useful discussions and comments on the manuscript. This research was performed while the first author held a National Research Council Research Associateship Award at NOAA's Earth System Research Laboratory. The publication was partially supported by a NOAA/Office of Weather and Air Quality (OWAQ) USWRP grant.

## REFERENCES

*Geostatistics,*M. Armstrong, Ed., Quantitative Geology and Geostatistics, Vol. 4, Springer, 77–89.

*Spatial Variation.*Lecture Notes in Statistics, Vol. 36, 2nd ed. Springer-Verlag, 151 pp.

*J. Comput. Graph. Stat.,*doi:, in press.

*Electron. J. Stat.,*

**8,**2619–2638, doi:.

## Footnotes

Supplemental information related to this paper is available at the Journals Online website: http://dx.doi.org/10.1175/MWR-D-14-00269.s1.