## 1. Introduction

Weather, climate, and hydrologic forecasters and other users need to be knowledgeable about the capabilities and limitations of different numerical weather prediction (NWP) models. Model developers must decide whether recent updates to a model have resulted in improved performance. Administrators often wish to know which forecast system is best. For all these reasons, common statistical measures have been used to summarize forecast performance for over a century. The focus of the current paper is on continuous forecast variables and continuous verification measures, rather than on categorical verification, such as is presented in Hamill (1999).

Often, competing NWP models are compared by examining individual error statistics for each and declaring the one with the best value of the statistic as the best model (e.g., Liu et al. 2014). Statistical uncertainty is not incorporated into this approach, but a point estimate is merely formed based on a specific sample, when in reality, a distribution of values is possible over all possible samples. To more accurately make inferences about the true performance of an NWP model, the uncertainty in the estimate should be included. More recently, some model comparisons have included estimates of sampling uncertainty through confidence intervals on verification statistics (e.g., Im et al. 2006; Wolff et al. 2014). Thus, users can more accurately assess the statistical significance of any differences they see in standard verification measures. The intervals also explicitly communicate the uncertainty in verification statistics and enforce that each is an estimate, rather than a measurement.

Though many verification statistics can be compared via parametric confidence intervals, other statistics benefit from a nonparametric interval estimate. In particular, bootstrap resampling methods have been investigated and recommended for many different verification statistics (Wilks 1997). The bootstrap is often seen as a panacea for hypothesis testing that is not burdened with assumptions like those necessary for the parametric tests. However, bootstrap resampling carries its own set of assumptions that need to be checked. For example, the usual procedure assumes that the data are independent and identically distributed (IID), an assumption that is usually violated for most forecast verification analyses. When the sample has temporal dependence, a circular block bootstrap is recommended (Paparoditis and Politis 2001; see section 2b for more details). Once a specific resampling procedure is executed, a sample of the statistic of interest is obtained, and it is from this sample that confidence intervals are derived. Various methods have been proposed for calculating these intervals, and each has its own set of properties.

Model comparisons are best completed using an identical set of cases. This matched comparative verification reduces the dimensionality of the problem (Murphy 1991) and increases the statistical power of the test when the matches are positively correlated. In particular, it allows for paired tests of the significance of the difference between two statistics while eliminating intracase variability. Compared to using confidence intervals on separate statistics for two models, the single interval around the difference provides more power to detect differences and will often give a different result than the use of two intervals (Jolliffe 2007).

Samples of forecasts are rarely independent of one another, particularly when nearby in space or time. In this case, standard confidence intervals are too narrow, as the observations are effectively partial repeats of other observations, resulting in a smaller effective sample size. To account for some of the dependence, the variance inflation factor (VIF; Kutner et al. 2004) has been applied to the variance estimate that is incorporated into the confidence interval calculation. The result is an increase in the interval width to reflect a more appropriate level of uncertainty.

Diebold and Mariano (1995) first introduced a test (denoted DM) to compare the accuracy of two competing sets of forecasts, whereby a weighted average of the empirical autocovariance function is used to obtain an estimate of the variance that accounts for the temporal dependence. This test was not intended to be the only method used to compare statistical models because there are optimal finite-sample tests that can be used to compare model fits, such as *F* tests for comparing nested Gaussian linear models or model selection criteria such as the Bayesian information criteria (BIC; Schwarz 1978). Indeed, Hansen et al. (2011) introduce a set of tests wherein many pairwise comparisons among the BICs of models fit to the full sample can be made. However, the DM test is still appropriate in many situations. When forecasts are not produced from a statistical model such that the parameters controlling the model fit cannot be identified, then tests that do not involve a likelihood, such as the DM test, are the only options available to compare forecasts. Examples of such forecasts include NWP models, expert forecasts, or cases when the model is proprietary and unknown. In addition, most full-sample, model-comparison procedures rely heavily on one-step-ahead quadratic loss; however, other asymmetric loss functions are oftentimes of primary interest (Hering and Genton 2010, 2011; Gilleland 2013). Diebold (2015) gives a nice review of instances in which the DM test is appropriate and when it is not.

The alternative to full-sample model comparisons is to do out-of-sample forecasting, wherein the data are split into a training and testing set: a statistical model is fit on the training set, and forecasts are produced for the testing set. While full-sample model comparisons are more powerful than out-of-sample testing procedures (Hansen and Timmermann 2015), for a fixed number of parameters, model selection criteria will always select the most heavily mined model. On the other hand, by performing out-of-sample forecasting, it becomes more difficult to overfit the model on the training data, as long as “strategic data mining” in the choice of the split between training and testing sets can be avoided. Thus, even if the models are known a priori and if quadratic loss is used to summarize forecasts, DM-type forecast comparison tests can still be appropriate. In this vein, West (2006) and Clark and McCracken (2013) make assumptions about the models themselves and propose more accurate tests for finite samples, but ultimately, the asymptotic normality assumption of the DM test is still reliable even in this setting (Diebold 2015).

Given the suite of testing procedures, it is of interest to know which tests are the easiest to implement and which are the most accurate. While substantial work has already been carried out in various fields to determine these characteristics, there are special circumstances in weather forecast verification and climate model evaluation that have received less attention. Of particular interest in the present work is the testing of competing forecast models against the same observation series. In this setting, contemporaneous correlation (denoted by *ρ* herein), where the two forecast models are correlated with each other, is an important attribute to consider (cf. Hering and Genton 2011; DelSole and Tippett 2014). To see why contemporaneous correlation is an issue in the comparative forecast verification setting, see the appendix, where it is shown that this correlation affects the variance of the loss differential series. Additionally, the strength and type of dependence may affect results for many of the approaches, even those that account for dependence. For example, the VIF assumes a simple autoregressive structure, and the estimate is made based on the order of the autoregressive process. In many cases, an order-1 process is assumed for simplicity, so it is of interest to see how such a test may be impacted if the order is higher.

Hering and Genton (2011) propose a modification to the test introduced by Diebold and Mariano (1995), and they test their approach against the latter under varying sample sizes, strengths of dependence, and strengths of contemporaneous correlation. This work follows the same path using a more comprehensive set of simulation scenarios and also applies several more commonly used tests in weather forecast verification studies. Each test requires a different set of assumptions that may not be met for a particular scenario, and it is important to understand how the tests may be affected in these cases. For example, one test accounts for temporal dependence but assumes a particular model for that dependence. Furthermore, in testing for power, Hering and Genton (2011) consider the case of different means of two error series. Often, weather forecasts are calibrated to have the same means in weather forecast verification. Therefore, we test for power when the means are both zero, but the variability of one error series is higher than the other. It is hoped that the results of this work will help advise users about which methods are most robust to the types of series that are often tested in the forecast verification setting.

## 2. Review of hypothesis testing and confidence intervals

Wilks (1997, 2011), Hamill (1999), Jolliffe and Stephenson (2011), Jolliffe (2007), and Gilleland (2010) all give considerable background on hypothesis testing and confidence intervals. A brief review for completeness is given here, as well as to establish notation and terminology that will be used.

*ζ =*0, where, for example,

*ζ*may relate to the differences in mean error (ME) between two forecasts (simple loss). One might obtain an estimate of

*ζ*that is close to zero, say,

*S*must be calculated that incorporates the uncertainty information along with a direct estimate of the parameter. For example, a typical test statistic iswhere

*ζ*, and the standard error

*z*test) assumes

Note that a statistical hypothesis test does not answer the question of whether or not the value

There are two ways in which the decision based on the test result can be wrong. The decision could be to reject *α* low, where *α* is referred to as the significance level for the test. The second type of error is typically handled by trying to maximize

Considering the test statistic *S* in Eq. (1) again, the probability of observing a test statistic at least as large in absolute value as the one observed, assuming that *S*. This probability is called the *p* value of the test. If the *p* value is less than the size or significance level of the test, then *p* value is nearly 0 (1), one can be comfortable in rejecting (failing to reject) *p* value is modest, say, 0.17, then the situation is less clear, but it allows for a conversation about the likelihood of a difference that can be used as evidence, even if no concrete decision can be made. The choice of significance level is one that must be made by the experimenter a priori, and it should take into consideration not just the size of the test, but also the power. In many situations relevant to testing for a slightly modified new weather forecast model, the size of the test should be increased in order to increase the power; although, if the sample size is large enough, then smaller test sizes may be reasonable. The usual choices, such as 0.01 and 0.05, were originally made before the advent of computers in order to reduce the number of probability distribution tables that needed to be made (Lehmann 1986, p. 69). For a recent, detailed description of the *p* value, and for the American Statistical Association’s statement on *p* values, see Wasserstein and Lazar (2016).

A confidence interval (CI) is considerably more useful than a hypothesis test. Hypothesis tests must be made for any test about the true value of a statistic. For example, if after testing whether or not *α*-level significance test. To test for a different value, the same CI can be used. Moreover, the length of the interval gives a good impression about the amount of uncertainty in the data that a hypothesis test does not as readily convey.

One drawback of the hypothesis testing and CIs described above is that the true parameter of interest is assumed to be fixed but unknown. This paradigm leads to the somewhat awkward interpretation of *S* and CI. That is, a *p* value is not a probability that the true value *ζ* is equal to the value of interest, but rather a probability statement about the statistic *S*. Similarly, a CI is not an interval that has a *ζ* lies in the interval. Instead, the interpretation of a CI is that if the experiment were repeated 100 times, then one can expect that the true parameter lies within

The approach described here is known as the frequentist approach, and it is possible to obtain similar tests that have a more satisfying interpretation at the expense of further assumptions. In particular, the Bayesian approach assumes that the parameter of interest itself is a random variable whose distribution must be assumed a priori. Information from observed data is used to update knowledge about the parameter’s distribution using Bayes’s formula to arrive at what is known as the posterior distribution. From this posterior distribution, a wealth of information concerning the unknown parameter can be gleaned. The approach can be very useful, provided the prior distribution is reasonable or that the posterior distribution is insensitive to the prior choice. Fiducial inference, which was introduced by Fisher (1934), is a frequentist approach that uses an invariance argument to establish a result that is operationally equivalent to Bayesian inference, but justified by what we would now call a conditional inference argument. However, that argument applied only to location-scale families, and Fisher spent much of the rest of his career trying to show that the same reasoning would apply in more general parametric families (R. L. Smith 2008, personal communication). More recently, interest in the fiducial framework has been reignited, and rigorous frequentist properties have been established for wide classes of fiducial procedures that put them on a sound mathematical basis from a frequentist perspective (e.g., Hannig 2009; Lidong et al. 2008; Hannig et al. 2006, 2007; Wandler and Hannig 2012). The focus of this paper, however, is on the more commonly used frequentist tests only.

### a. Loss functions

Here, three loss functions are considered for summarizing forecast accuracy: simple-, absolute- (AE), and square-error (SE) loss; the result amounts to testing the differences in mean error, mean absolute error (MAE), and mean square error (MSE) between the forecast and observed series. The series of simple, AE, or SE differences are called the loss differential series.

With simple loss, it is possible for a series with, for example, larger errors to test out as being better than a series with smaller errors because of the potential for positive errors to cancel out with negative ones. In this regard, AE and SE loss generally provides more useful tests, but ultimately the summary of interest will depend on the users’ needs.

### b. Review of some specific testing procedures

Considering Eq. (1) again, recall that a distribution for *S* must be assumed in applying standard parametric tests. The different tests considered are tantamount to making different assumptions about the distribution of *S*, and some differ in how the standard error of *ζ* is estimated. An exception to this rule is that two types of test statistics might be considered in the context of comparing two different forecast models to the same observed series. One is known as a two-sample test, and the other is a paired test. The two-sample test is a test on the difference in the sample means of the two error series, whereas a paired test is a test on whether or not the point-by-point sample differences, on average, are different from zero or not.

The test procedures considered here include (i) paired and two-sample *t* tests; (ii) paired and two-sample *z* tests; (iii) paired and two-sample *z* tests with AR(1) VIF applied; (iv) paired bootstrap procedures under varying assumptions; and (v) the Hering–Genton (HG) test (Hering and Genton 2011), which are both paired tests.

#### 1) Normal and Student’s *t* tests

Rationale for the normal distribution comes from the central limit theorem, which provides justification for assuming a normal distribution for means of random variables. For smaller sample sizes, scaling issues with the variance often lead to violations in the normal assumption, and the Student’s *t* test has been shown to be a better approximation. When the sample size reaches about 30, the quantiles of the Student’s *t* distribution are approximately equal to those of the normal distribution so that the *z* test can generally be applied for sample sizes of 30 or larger. Without VIF, there is an assumption that the sample is realized from an IID population. A violation of this assumption will generally lead to tests with standard error terms that are too small when observations are positively correlated, subsequently leading to larger values of *S*, which are in turn found to be less likely under

#### 2) Bootstrapping

Bootstrap tests do not assume a specific distribution for *S* and instead let the data speak for themselves. The primary assumption is that the relationship between the sample distribution and that of the population is consistent with the relationship between the bootstrap resampled distribution and the sample distribution. For example, if *S* are implicit in the bootstrap procedures, and most of the different techniques are aimed at relaxing those assumptions.

Many good reviews of the IID bootstrapping procedures are available (e.g., Davison and Hinkley 1997; Efron and Tibshirani 1998; DiCiccio and Efron 1996). Lahiri (2003) gives an excellent overview of the bootstrap procedure for dependent data, along with a concise, but thorough, review of the IID case. Gilleland (2010) provides an accessible review in the context of forecast verification. While one can calculate a *p* value using this paradigm, more often, confidence intervals are desired, and the following discussion uses that terminology.

In perhaps its simplest form, the IID bootstrap procedure is carried out as follows. Given a series of observations

- Draw a sample of size
*n*with replacement from**x**, say,. - Estimate the parameter(s) of interest from the resampled data, denoted
, using the sample . - Repeat steps 1 and 2
*B*times to obtain a sample of the estimate(s). - Calculate confidence limits from the sample obtained in step 3.

The simplest, and perhaps most common, method for obtaining the confidence interval in step 4 is to use the percentile method, whereby the *q*th quantile. This method is straightforward and works well if **X** are IID and the estimator *ζ* is unbiased. In fact, it has been shown that for estimating the mean, the bootstrap interval is more accurate than the normal approximation method (Singh 1981). If the assumptions of unbiasedness and constant variance are not met, then the bias-corrected and adjusted (BCa) method will give highly accurate intervals. Because the percentile method intervals result as a special case of the BCa procedure when the assumptions are met, the BCa method is generally a better choice. However, because the BCa requires an additional set of resampling, it can be computationally expensive to calculate. Therefore, the percentile method is sometimes still preferred. Either interval is range preserving (i.e., it is impossible to obtain limits for *ζ* that are not in the natural range of the parameter) and transformation respecting [i.e., if a monotonically increasing function *f* is applied to the limits, the resulting interval is appropriate for

*t*, or Studentized, method. It requires estimation of the standard error for each parameter of interest at each iteration of the bootstrap so that in step 2 of the algorithm,

Another interval, which will not be considered here, is the normal approximation bootstrap interval. It requires the normality assumption and is given by

Of course, if the series **X** is not independent, then the IID bootstrap procedure will not give valid confidence limits and, subsequently, will not yield an appropriately sized hypothesis test. In such a circumstance, a parametric bootstrap may be applied, but it is simpler and more customary to apply a block bootstrap procedure. In such a procedure, blocks of values

The choice for the number of resamples *B* is made by attempting to find the smallest *B* for computational efficiency, where a higher value will not change the results substantially. To obtain all possible combinations of a data series of size *n* would require *n* resamples {i.e., *n* (e.g., if

#### 3) DM and HG tests

*k*-step-ahead forecast. These sample autocovariances are computed for each lag

*τ*asbut in practice, their estimator sometimes results in a negative value, and the sum over all lags of the empirical ACF is identically zero. The HG estimator avoids these problems by fitting a parametric covariance model to the empirical ACF that is guaranteed to be positive definite. In practice, an exponential model fits many empirical ACFs very well, but other choices can be made depending upon the structure of the ACF. To obtain the most appropriate parametric model, the model is fit only to lag correlations that are shorter than about half the maximum lag for the observed series, while the final estimate of the standard error is summed over all lags from the modeled ACF.

## 3. Simulations

To gauge the effect of contemporaneous correlation on the hypothesis testing procedures, as well as differing serial dependence structures, two types of simulations are considered. The first, described in section 3a, induces serial dependence by way of a simple moving average model that makes it easy to vary the strength of the dependence while also varying the contemporaneous correlation between two series. Beyond this exercise, differing dependence structures are examined between uncorrelated simulated error series in order to evaluate whether, and if so, to what extent, certain dependence structures affect each testing procedure. Section 3b describes simulations created for this purpose.

### a. MA(1) simulations

To compare results here with those from Hering and Genton (2011), the same first-order moving average [MA(1)] simulation study is conducted, which varies the sample size, range of temporal dependence, and contemporaneous correlation of two sets of time series. Hering and Genton (2011) found that the HG test was superior to the DM test in size and that the HG test is a powerful test. However, in their testing procedure for power, they varied the mean and not the variance of the second error series. Because weather forecasts are often calibrated prior to verification studies, it makes sense to study power in this setting by varying the variance of the second series. Therefore, power will be tested again, under this paradigm, for the HG test. Figure 1 (top row) shows an example of a simulated MA(1) series along with its ACF and PACF plots.

### b. Other dependence structures

*z*test with VIF used herein, is given by

Obtaining the necessary multiplication factor for the ARMA(2, 2) process is considerably more complicated to spell out, but it is easily found using, for example, one of the methods described in Brockwell and Davis (2010, chapter 3), and for the specific values used in this study, it is about one-third.

The AR(2) and ARMA simulations are carried out using the arima.sim function from the R (R Core Team 2014) package stats. The parameters used for the simulated series are the following:

- AR(2):
, . - ARMA(1, 1):
, . - ARMA(2, 2):
, , , .

All simulations are performed using 1000 replications. For example, if the sample size under study is 100, then 1000 samples of size 100 are simulated. This number is chosen arbitrarily but should be sufficiently large as to yield robust results.

Because the paired tests are conducted on the loss differentials, one might think that perhaps the temporal dependence is removed through the differencing. However, such dependence is not removed and generally takes on a form that is very similar to the two series being differenced; when the two series have different dependence structures, the result is generally a series with a combination of the structures. For a discussion on properties of the loss differential series, including the resulting dependence structures, see the appendix. The *z* test with VIF applied assumes a particular AR(*p*) dependence model, and most often the simplest choice with

## 4. Results

In this section, the MA(1) model described in section 3a is simulated to have contemporaneous correlation of

Each simulation is performed 1000 times for each combination, and each of the seven tests (two-sample and paired *t* tests, normal approximation *z* test with and without VIF applied, the IID and CB bootstrap methods, and the HG test) with significance level *α* are applied, yielding a binary sample of 1000 giving 0 if *α*% of the 1000 tests will be 1. The actual percentage of times a test results in 1 gives the empirical size for the testing procedure, which should be close to *α* if the test is accurate. Empirical size results are given as percentages, and the closer the value is to *α*, the better the test. For testing empirical power,

In terms of size, the level of the test is generally found to not be an important factor, so only the 10% level results are displayed for brevity, except when a marked difference does occur. For tests that have appropriate size and are competitive with the HG test, power is also tested empirically. Otherwise, if a test does not have appropriate size, it should not be considered further. Furthermore, results are only shown for AE loss, as they are similar to those for SE loss. Comments are made when differences do occur.

### a. Size

#### 1) MA(1) simulations

To test for the effects of sample size, range of temporal dependence, and contemporaneous correlation, the MA(1) simulations conducted in Hering and Genton (2011) for the DM and HG tests are employed. Hering and Genton (2011) already determined that the HG test has good empirical size using these simulations, better than the DM test. Therefore, it remains to compare the other, more commonly used tests that were not compared by Hering and Genton (2011).

Figures 2–4 show the empirical size results for the MA(1) simulations, with the HG test results included for reference. The plots give the results using AE loss. Those for SE loss are found to be analogous, and the simple loss results mostly show a distinct lack of accuracy for tests, except for HG, where the results are analogous to those for AE and SE. Results for both IID and CB bootstrap methods are shown for the basic CI method. Results for the percentile method are analogous, while those for BCa are similar, but the accuracy is more variable. The closer a test’s empirical size is to the horizontal dotted line (through

When no temporal dependence and no contemporaneous correlation are present, all of the tests fare similarly in terms of accuracy, although the CB bootstrap is oversized by a larger amount than all other tests, becoming less oversized to about 2% for larger sample sizes, which suggests that its accuracy is diminished when temporal dependence is not present (cf. Fig. 2, top-left panel). With the introduction of moderate contemporaneous correlation (*t* and both *z* tests become severely undersized, with an empirical size estimated at about half the value of *α*. With strong contemporaneous correlation (*t*, and HG test do not appear to be greatly affected by the addition of contemporaneous correlation without temporal dependence, even stronger such correlation. Indeed, both the *t* test and IID bootstrap are competitive with the HG test under contemporaneous correlation, provided no temporal dependence is present.

Although the paired *t* test fared well for accuracy in the face of contemporary correlation with no temporal dependence, the addition of moderate temporal dependence (*α*. The behavior of the *z* test without VIF applied is erratic in this case, as it is oversized by about 50% with no contemporaneous correlation (Fig. 3, top-left panel), has relatively good accuracy when contemporaneous correlation and temporal dependence are moderate (both at 0.5; Fig. 3, top-right panel), and has empirical size that is nearly zero in the case with strong contemporaneous correlation and moderate temporal dependence. Similarly, for simple loss (not shown), the only scenario where the *z* test and two-sample *t* test are accurate occurs when the contemporaneous correlation is set to

The *z* test with VIF applied is grossly inadequate for these MA(1) simulations, having nearly zero test size for all but the case of no correlation, where it is reasonable, and under moderate contemporaneous correlation without temporal dependence, where it is still undersized by about half of *α*.

Regardless of the scenario, none of the tests are very accurate for small sample sizes, except in the case where no type of dependence is introduced, where both *t* tests are the most accurate for the two smallest sample sizes (*t* test still performs well for the smallest sample size tested of

The poor results for the *z* test + VIF method can be explained by the fact that the simulations are from an MA(1) rather than an AR(1) process, upon which the VIF applied here is based. While both series have a constant mean of zero and constant variances, they have very different correlation structures, and the VIF utilizes the correlation structure of the AR(1) to inflate the variance. In particular, an AR(1) series has correlation *ρ*. This result leads to an estimated AR(1) VIF,

#### 2) Other dependence structures

##### (i) Parametric tests

Empirical size results (not shown) for the AR(2) simulations using the parametric testing procedures that assume IID data do not fare very well. They are substantially oversized, especially under simple loss. On the other hand, tests that apply the VIF perhaps overcompensate because they are very undersized, and it is found that their power is also very low. Results for empirical size using the ARMA(1, 1) and ARMA(2, 2) models are analogous to those for the AR(2).

##### (ii) Bootstrap tests

Empirical size results (not shown) for the AR(2), ARMA(1, 1) and ARMA(2, 2) simulations show that because of the strong dependence in each of the simulated series, the IID bootstrap tests are oversized by a larger magnitude than the other tests. The CB bootstrap method yields tests that are rather oversized for very small sample sizes, but only oversized by a small amount as the sample size increases beyond about 100, and it approaches the correct size with increasing sample sizes. The type of dependence structure does not appear to affect the CB method, which is both desired and expected. Moreover, it clearly outperforms the parametric test with VIF applied in terms of size. It is reassuring that the CB bootstrap approach performs well under these scenarios.

### b. Power

As described in section 3, when checking for power, the assumption is that any bias will have been already removed from the forecast via calibration so that only the error variances are of interest here. In particular, while checking size properties under simple loss made sense, it is not possible to test for power under this paradigm for this loss function because *X* with mean *μ* and variance

#### 1) MA(1) simulations

Empirical power results are carried out for the case of *t* test (not shown), as it is found to have accurate empirical test size in this setting. A higher percentage of rejected tests is better because under the power simulations, the mean loss differential does not follow

Similar to the paired *t* test, the IID bootstrap procedure is found to have accurate size, regardless of contemporaneous correlation when no temporal dependence is introduced. The CB bootstrap is found to have reasonable accuracy, regardless of contemporaneous correlation or temporal dependence, but is nevertheless oversized. Therefore, power is tested only for the IID approach. With

The HG test (Table 1) is powerful when *ρ* and is above 80% for sample sizes as small as 16 when

Empirical power (%) results under AE loss with *ρ* is the contemporaneous correlation, and *θ* is the strength of dependence. Values of *ρ*(*θ*) closer to 1 result in stronger contemporaneous correlation (temporal dependence) and closer to 0 result in weaker correlation (dependence). Values of at least 80% are presented in boldface font.

Despite the fact that the CB bootstrap’s empirical size is consistently too high, for larger sample sizes, the amount of its overage is relatively small: on the order of about 2% over *α*. Therefore, its power is also tested here (not shown), and for sample sizes over 30, it is a powerful test.

#### 2) Other dependence structures

##### Bootstrap tests

The CB bootstrap method requires larger sample sizes to have any power for the AR(2) simulations. For *α* increases, reaching about 65% at

## 5. Conclusions

It is often necessary to compare two competing forecasts in terms of the differences in their verification summaries. The main objective of this paper is to test common statistical inference procedures for comparing competing forecast models in the face of both temporal dependence and contemporaneous correlation. Multiple testing procedures for comparing competing forecasts against the same set of observations are examined to determine their accuracy in terms of size and, when appropriate, power. The tests are conducted on simulated series with known types of temporal dependence, as well as with contemporaneous correlation between the competing models. The latter is an often overlooked issue that can have a large impact on results (cf. DelSole and Tippett 2014; Hering and Genton 2011; Li et al. 2016).

The simulated series with contemporaneous correlation are from the same MA(1) simulations as those applied by Hering and Genton (2011), which were used to test their method, the HG test, against the one proposed by Diebold and Mariano (1995). Because Hering and Genton (2011) do not include other more common tests in their work, such testing is conducted here. In the interest of brevity, only the most commonly applied testing procedures are considered along with the HG test. In particular, the usual two-sample and paired *t* and *z* tests are carried out both with and without the variance inflation factor (VIF) applied, where the VIF is based on an AR(1) model. Furthermore, the IID and CB bootstrap methods are carried out using various different methods for constructing CIs from the resulting parameter resamples.

Additional simulations from other time series models are also employed in order to ascertain whether or not the specific type of dependence structure affects the outcomes. As shown in the appendix, the loss differential series do not lose any of the dependence from the individual series through the differencing process; rather, they retain a dependence structure that inherits from both series, resulting in a more complex dependence structure. Therefore, the additional simulations shed light on whether the results here are likely to be more general or not.

The appendix illustrates that the series resulting from taking the difference between two AR(1) series with identical autoregressive parameter *ϕ* would result in another AR(1) process with autoregressive parameter *ϕ*. However, in practice, it is unlikely that two series will have identical autoregressive parameters, and the result of taking the difference between two such series is no longer an AR(1) process. In fact, a similar exercise to the MA(1) study described in section 3a was also carried out, but using AR(1) models. The results did not improve. Therefore, one conclusion from this study is that none of the traditional tests, even the *z* test + VIF, should be used when testing which of two competing forecasts is better.

The HG test is included mainly for power comparisons, where power is determined by varying the variance of the second simulated series instead of the mean because weather forecast verification and climate model evaluation are often calibrated to have the same mean as the observation series. Thus, we have shown that the HG test can also be used to test for differences in variances of two sets of calibrated forecasts. Results for empirical size for the HG test are also shown for comparison and because they were not reported for AE loss in Hering and Genton (2011). Methods that test for equality of variances, lagged autocorrelation, and/or means between two time series do exist, but they often ignore contemporaneous correlation (e.g., Lund et al. 2009; Lund and Li 2009).

It is verified that the paired testing procedures are considerably more accurate than the two-sample tests, which work well only when no contemporaneous correlation or temporal dependence exist and are otherwise fairly undersized. While the paired *t* test is reasonably accurate for size, regardless of contemporaneous correlation, the test is generally not very powerful, although it still has good power if the sample size is large and no temporal dependence is present. Generally, in the realm of forecast verification, where strong contemporaneous correlation and temporal dependence are likely to occur, the test is not recommended. The accuracy of the paired *z* test is greatly affected by both contemporaneous correlation and temporal dependence. The *z* test with VIF applied, where the VIF is calculated assuming an AR(1) model, is not accurate when the temporal dependence structure is not approximately AR(1).

All of the bootstrap procedures are found to be reasonably robust to contemporaneous correlation and structure of temporal dependence, but require fairly large sample sizes before they become accurate. One exception is for the IID bootstrap in the case of no temporal dependence, which is accurate for relatively small sample sizes. For the largest sample size simulated (

The HG test is the most accurate, powerful, and robust of all of the procedures compared. Its main drawback is the necessity of fitting a parametric model to the autocovariance function (ACF), which can add computational cost, and automated optimization numerics can sometimes fail. Li et al. (2016) take a functional approach to test for a difference in means and/or variances between two space–time fields wherein sensitivity to model misspecification is reduced, and future work could propose an analogous method for time series. Nevertheless, the HG test is otherwise a straightforward procedure that makes a fairly modest modification to the classical tests, which are the easiest to implement but are also the least appropriate. The CB method requires multiple resampling and, given the need for large samples, can also be computationally expensive, generally much more so than the HG test, which could be implemented on a smaller subset if need be.

It is reasonable to expect that some forecast models might capture the observed series behavior very well, but could have a timing issue so that the direct match might result in a much poorer measure. It is even possible that one forecast might have a small timing error but otherwise be perfect, but the test might prefer a worse model that does not have the timing error. Such issues are not handled by any of the testing procedures studied here, in part because the answer to the question of which model is better becomes more complex, making size and power comparisons difficult to conduct. Further, if timing errors are less important to a user, then the forecast could be calibrated for timing errors before performing the test. Alternatively, such a realignment can be performed within the testing procedure (cf. Gilleland and Roux 2015).

The tests investigated in this study pertain to univariate time series. Hering and Genton (2011) proposed and tested an extension of their testing procedure to the spatial domain and found it to also have appropriate size and good power, provided the spatial and contemporaneous correlations are not too strong. Gilleland (2013) introduced a loss function within their testing procedure that allows for spatial displacement errors to be taken into account. While the HG test in either its univariate or spatial form is relatively new, it is already beginning to be used in practice (e.g., Ekström and Gilleland 2017; Gilleland et al. 2016; Lenzi et al. 2017; Razafindrakoto et al. 2015; Zhang et al. 2015).

Support was provided by the National Science Foundation (NSF) through Earth System Modeling (EaSM) Grant AGS-1243030. The National Center for Atmospheric Research is sponsored by the National Science Foundation.

# APPENDIX

## Properties of the Loss Differential for the Simulations

The loss differential series is a combination of two time series models, which will result in a new series that inherits certain properties from the original series. The specific forms for the loss differential series are not crucial to this study because of the simulation focus. However, it can be instructive to understand how certain test assumptions may be violated in the domain of forecast verification. First, expected values and variances are derived for loss differential series under the much simpler case of IID series before investigating how dependence structures carry over from the original to the loss differential series.

### a. IID series

If it is supposed that the observed series, say, *Y*, then it can be written as *ε* normally distributed with mean zero and variance *Y*, but of the observed series so that they can be written *X* with mean

*ρ*is not an issue, but it does appear in the variances of the loss differential. The variance for the IID loss differential series under simple loss isThe variances for the loss differential series with AE and SE losses are considerably more complicated, but both involve the contemporaneous correlation.

### b. Dependent series

Under simple loss, it is very straightforward to calculate what sort of time series dependence is present for the loss differential series, but it is also possible to determine the nature of the loss differential under more complicated loss functions. In doing so, it is helpful to use the difference operator *B*, where

The properties below pertain to the simulated series considered in this paper and would generally require certain assumptions that are left out here for brevity. For ease of understanding, it is also shown how the loss differential series would behave as a result of combining two AR(1) processes.

*ϕ*yield a loss differential series that is also AR(1) with innovation variance

*π*and

*κ*, respectively, can be written in the formMultiplying the first equation in Eq. (A2) by

Figure A1 shows the ACF^{1} [Eq. (2)] and partial autocovariance function (PACF) plots for one set of the AR(1) simulations of two (correlated) series.

^{2}Again, using the series described in Eq. (A1), the loss differential under SE loss becomeswhich becomeswhich is an AR(2) process with autocorrelation coefficients

*r*,

*m*) and

*s*,

*n*) model, each with different autoregressive and moving average parameters. Then,andwhere

*p*,

*q*) model, where

## REFERENCES

Brockwell, P. J., and R. A. Davis, 2010:

*Introduction to Time Series and Forecasting.*2nd ed. Springer, 437 pp.Carpenter, J., 1999: Test inversion bootstrap confidence intervals.

,*J. Roy. Stat. Soc.***61B**, 159–172, https://doi.org/10.1111/1467-9868.00169.Clark, T. E., and M. W. McCracken, 2013: Advances in forecast evaluation.

*Handbook of Economic Forecasting*, G. Elliott and A. Timmermann, Eds., Vol. 2, Elsevier, 1107–1201.Davison, A., and D. Hinkley, 1997:

*Bootstrap Methods and Their Application.*Cambridge University Press, 596 pp.DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill.

,*Mon. Wea. Rev.***142**, 4658–4678, https://doi.org/10.1175/MWR-D-14-00045.1.DiCiccio, T., and J. P. Romano, 1990: Nonparametric confidence limits by resampling methods and least favorable families.

,*Int. Stat. Rev.***58**, 59–76, https://doi.org/10.2307/1403474.DiCiccio, T., and B. Efron, 1996: Bootstrap confidence intervals.

,*Stat. Sci.***11**, 189–228, https://doi.org/10.1214/ss/1032280214.Diebold, F. X., 2015: Comparing predictive accuracy, twenty years later: A personal perspective on the use and abuse of Diebold–Mariano tests.

,*J. Bus. Econ. Stat.***33**, https://doi.org/10.1080/07350015.2014.983236.Diebold, F. X., and R. S. Mariano, 1995: Comparing predictive accuracy.

,*J. Bus. Econ. Stat.***13**, 253–263, http://doi.org/10.1080/07350015.1995.10524599.Efron, B., and R. Tibshirani, 1998:

*An Introduction to the Bootstrap.*Chapman and Hall, 456 pp.Ekström, M., and E. Gilleland, 2017: Assessing convection permitting resolutions of WRF for the purpose of water resource impact assessment and vulnerability work: A southeast Australian case study.

,*Water Resour. Res.***53**, 726–743, https://doi.org/10.1002/2016WR019545.Fisher, R. A., 1934: Two new properties of mathematical likelihood.

,*Proc. Roy. Soc. London***144A**, 285–307, https://doi.org/10.1098/rspa.1934.0050.Garthwaite, P. H., and S. T. Buckland, 1992: Generating Monte Carlo confidence intervals by the Robbins–Monro process.

,*Appl. Stat.***41**, 159–171, https://doi.org/10.2307/2347625.Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Tech. Note NCAR/TN-479+STR, 71 pp.

Gilleland, E., 2013: Testing competing precipitation forecasts accurately and efficiently: The spatial prediction comparison test.

,*Mon. Wea. Rev.***141**, 340–355, https://doi.org/10.1175/MWR-D-12-00155.1.Gilleland, E., and G. Roux, 2015: A new approach to testing forecast predictive accuracy.

,*Meteor. Appl.***22**, 534–543, https://doi.org/10.1002/met.1485.Gilleland, E., M. Bukovsky, C. L. Williams, S. McGinnis, C. M. Ammann, B. G. Brown, and L. O. Mearns, 2016: Evaluating NARCCAP model performance for frequencies of severe-storm environments.

,*Adv. Stat. Climatol. Meteor. Oceanogr.***2**, 137–153, https://doi.org/10.5194/ascmo-2-137-2016.Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts.

,*Wea. Forecasting***14**, 155–167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.Hamilton, J. D., 1994:

*Time Series Analysis.*Princeton University Press, 799 pp.Hannig, J., 2009: On generalized fiducial inference.

,*Stat. Sin.***19**, 491–544.Hannig, J., H. K. Iyer, and P. Patterson, 2006: Fiducial generalized confidence intervals.

,*J. Amer. Stat. Assoc.***101**, 254–269, https://doi.org/10.1198/016214505000000736.Hannig, J., H. K. Iyer, and C. M. Wang, 2007: Fiducial approach to uncertainty assessment accounting for error due to instrument resolution.

,*Metrologia***44**, 476–483, https://doi.org/10.1088/0026-1394/44/6/006.Hansen, P. R., and A. Timmermann, 2015: Equivalence between out-of-sample forecast comparisons and Wald statistics.

,*Econometrica***83**, 2485–2505, https://doi.org/10.3982/ECTA10581.Hansen, P. R., A. Lunde, and J. M. Nason, 2011: The model confidence set.

,*Econometrica***79**, 453–497, https://doi.org/10.3982/ECTA5771.Hering, A. S., and M. G. Genton, 2010: Powering up with space-time wind forecasting.

,*J. Amer. Stat. Assoc.***105**, 92–104, https://doi.org/10.1198/jasa.2009.ap08117.Hering, A. S., and M. G. Genton, 2011: Comparing spatial predictions.

,*Technometrics***53**, 414–425, https://doi.org/10.1198/TECH.2011.10136.Im, J.-S., K. Brill, and E. Danaher, 2006: Confidence interval estimation for quantitative precipitation forecasts (QPF) using short-range ensemble forecasts (SREF).

,*Wea. Forecasting***21**, 24–41, https://doi.org/10.1175/WAF902.1.Jolliffe, I. T., 2007: Uncertainty and inference for verification measures.

,*Wea. Forecasting***22**, 637–650, https://doi.org/10.1175/WAF989.1.Jolliffe, I. T., and D. B. Stephenson, 2011:

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science.*2nd ed. Wiley and Sons, 292 pp.Kabaila, P., 1993: Some properties of profile bootstrap confidence intervals.

,*Aust. NZ J. Stat.***35**, 205–214, https://doi.org/10.1111/j.1467-842X.1993.tb01326.x.Kutner, M. H., C. J. Nachtsheim, and J. Neter, 2004:

*Applied Linear Regression Models.*4th ed. McGraw-Hill, 701 pp.Lahiri, S., 2003:

*Resampling Methods for Dependent Data.*Springer, 382 pp.Lehmann, E. L., 1986:

*Testing Statistical Hypotheses.*2nd ed. Springer, 600 pp.Lenzi, A., P. Pinson, L. H. Clemmensen, and G. Guillot, 2017: Spatial models for probabilistic prediction of wind power with application to annual-average and high temporal resolution data.

,*Stochastic Environ. Res. Risk Assess.***31**, 1615–1631, https://doi.org/10.1007/s00477-016-1329-0.Leone, F. C., L. S. Nelson, and R. B. Nottingham, 1961: The folded normal distribution.

,*Technometrics***3**, 543–550, https://doi.org/10.1080/00401706.1961.10489974.Li, B., X. Zhang, and J. E. Smerdon, 2016: Comparison between spatio-temporal random processes and application to climate model data.

,*Environmetrics***27**, 267–279, https://doi.org/10.1002/env.2395.Lidong, E., J. Hannig, and H. K. Iyer, 2008: Fiducial intervals for variance components in an unbalanced two-component normal mixed linear model.

,*J. Amer. Stat. Assoc.***103**, 854–865, https://doi.org/10.1198/016214508000000229.Liu, C., R. P. Allan, M. Brooks, and S. Milton, 2014: Comparing tropical precipitation simulated by the Met Office NWP and climate models with satellite observations.

,*J. Appl. Meteor. Climatol.***53**, 200–214, https://doi.org/10.1175/JAMC-D-13-082.1.Lund, R., and B. Li, 2009: Revisiting climate region definitions via clustering.

,*J. Climate***22**, 1787–1800, https://doi.org/10.1175/2008JCLI2455.1.Lund, R., H. Bassily, and B. Vidakovic, 2009: Testing equality of stationary autocovariances.

,*J. Time Ser. Anal.***30**, 332–348, https://doi.org/10.1111/j.1467-9892.2009.00616.x.Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality.

,*Mon. Wea. Rev.***119**, 1590–1601, https://doi.org/10.1175/1520-0493(1991)119<1590:FVICAD>2.0.CO;2.Paparoditis, E., and D. N. Politis, 2001: Unit root testing via the continuous-path block bootstrap. UCSD Economics Discussion Paper 2001-06, 44 pp., https://escholarship.org/uc/item/9qb4r775.

Razafindrakoto, H. N. T., P. M. Mai, M. G. Genton, L. Zhang, and K. K. S. Thingbaijam, 2015: Quantifying variability in earthquake rupture models using multidimensional scaling: Application to the 2011 Tohoku earthquake.

,*Geophys. J. Int.***202**, 17–40, https://doi.org/10.1093/gji/ggv088.R Core Team, 2014: R: A language and environment for statistical computing. R Foundation for Statistical Computing.

Schwarz, G., 1978: Estimating the dimension of a model.

,*Ann. Stat.***6**, 461–464, https://doi.org/10.1214/aos/1176344136.Singh, K., 1981: On the asymptotic accuracy of Efron’s bootstrap.

,*Ann. Stat.***9**, 1187–1195, https://doi.org/10.1214/aos/1176345636.Wandler, D., and J. Hannig, 2012: Generalized fiducial confidence intervals for extremes.

,*Extremes***15**, 67–87, https://doi.org/10.1007/s10687-011-0127-9.Wasserstein, R. L., and N. A. Lazar, 2016: The ASA’s statement on

*p*-values: Context, process, and purpose.,*Amer. Stat.***70**, 129–133, https://doi.org/10.1080/00031305.2016.1154108.West, K. D., 2006: Forecast evaluation.

*Handbook of Economic Forecasting*, C. Granger and A. Timmermann, Eds., Vol. 1, Elsevier, 100–134.Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields.

,*J. Climate***10**, 65–82, https://doi.org/10.1175/1520-0442(1997)010<0065:RHTFAF>2.0.CO;2.Wilks, D. S., 2011:

*Statistical Methods in the Atmospheric Sciences*. 3rd ed. International Geophysics Series, Vol. 100, Academic Press, 704 pp.Wolff, J. K., M. M. Harrold, T. L. Fowler, J. Halley Gotway, L. Nance, and B. G. Brown, 2014: Beyond the basics: Evaluating model-based precipitation forecasts using traditional, spatial, and object-based methods.

,*Wea. Forecasting***29**, 1451–1472, https://doi.org/10.1175/WAF-D-13-00135.1.Zhang, L., P. M. Mai, K. K. Thingbaijam, H. N. Razafindrakoto, and M. G. Genton, 2015: Analysing earthquake slip models with the spatial prediction comparison test.

,*Geophys. J. Int.***200**, 185–198, https://doi.org/10.1093/gji/ggu383.

^{1}

Fig. A1 shows the auto-correlation function graph, labeled ACF in the figure, instead of the auto-covariance function (labeled ACF in the text). The auto-correlation function is the auto-covariance function divided by its value at lag zero. The PACF is a conditional correlation of a time series with its own lags controlling for shorter lags (see Brockwell and Davis 2010 for more details).

^{2}

If the autocorrelation coefficients are different, then the loss differential series would be an ARMA(4, 2) process.