## 1. Introduction

One main point of the present paper concerns a qualitative comparison between two types of comparative forecast performance measures: one based on intensity and another based on the frequency that one is better than the other. A typical situation in model development involves comparing one model, say model A, to another, say model B, to see which one is better in some sense. For example, one might compare each model to an observation by calculating a *loss function*, such as model minus the observation, or model minus the observation squared, etc. at each time point. The resulting loss series might then be compared by looking at their differences, the resulting series is called a *loss differential* series (cf. Hering and Genton 2011). A statistical test, such as a paired *t* test, is often used to make inferences about the mean of this loss differential series, and whether or not the differences in the average loss differential are significant enough to warrant concluding that model A or model B is better.

Often model B might represent a small upgrade to model A that does not result in a statistically significant difference according to the type of testing outlined above. In this paper, a complementary type of test is suggested that is concerned with how often one model is superior to another, according to the loss function, regardless of how much (or how little) the improvement is. In fact, two models could be found to be distinguishable in terms of frequency-of-better but found to be different in terms of intensity, and vice versa. Just as it is possible to have one intensity-based measure differ in its conclusions from another different intensity measure.

Note that the testing of concern here is one of statistical significance, and not *practical* significance. In the above paradigm, an average loss differential equal to zero means that there is no difference between model A and B. However, the value is unlikely to be exactly zero. Statistical significance addresses whether the value of the average loss differential is different from zero purely by random chance or whether it is unlikely to achieve its observed value if the two models were the same. Practical significance concerns whether or not the differences between models A and B are important from an expert’s point-of-view.

The power-divergence statistic is a generalized (categorical) goodness-of-fit statistic that encompasses many familiar performance summaries such as the loglikelihood-ratio, Kullback–Leibler and Pearson’s *X*^{2} statistics. It also can be written as a function of several common verification measures such as the continuous ranked probability score (CRPS). The form of statistic is determined by a single parameter *λ*, and regardless of the choice of *λ*, it follows a *k* represents the numbers of categories. However, the theory that justifies this choice of distribution relies on having independent and identically distributed source data. Here, empirical results demonstrate that the reduction of the data into two categories measuring simply whether one of two competing forecasts is better than the other (based on some loss function) more frequently than 50% of the time, or not, results in a test that is reasonably robust to rather severe departures from the assumption of independence, as well as contemporaneous correlation between the two forecasts,^{1} for certain choices of *λ*.

The novelty of this paper is in demonstrating the utility of the power-divergence statistic in the face of dependent data, as well as the emphasis on testing for the “frequency-of-better” alongside more traditional measures.

## 2. The power-divergence statistic

The appendix gives the full definition of the power-divergence statistic for the general setting where there are an arbitrary number of categories, but for the present treatment, we are only concerned with two categories: 1) model A is better than model B, and 2) model B is better than model A. We ignore ties because, in our setting, they represent a discrete point in a continuous space so have probability zero of occurring. Therefore, in this section, the power-divergence statistic is given as a two-category measure.

As is described in detail in the appendix, the power-divergence statistic is a generalized statistic that includes several well-known goodness-of-fit statistics depending on a user-specified argument denoted *λ*. Asymptotically, however, the statistic is the same regardless of the choice of *λ* (Cressie and Read 1984). That is, for large enough sample sizes, there will be no difference in the test statistics value for any choice of *λ*, and generally, even for smaller sample sizes, the differences tend to be small. Nevertheless, there are good reasons for using *λ* = 2/3 for most purposes (see Read and Cressie 1988, for a thorough explanation of the benefits of this choice).

Because the interest is in two categories, the data follow a binomial distribution with some probability, *p*, of being in the first category, and 1 − *p* of being in the second category. Interest is in whether or not *p* = 1/2 (i.e., the null hypothesis is *p* = 1/2 even if its estimated value is not 1/2.

It is possible to perform what is called an “exact” test^{2} using the fact that the data are binomially distributed in order to test the hypothesis. However, this binomial test is known to be inaccurate when the underlying data are dependent. Binomial tests for dependent data exist (e.g., Woodbury 1949; Drezner and Farnum 1993; Singh and Kumar 2020) but are much more complicated than using the power-divergence test. It is also possible to do a normal approximation test, but requires estimating a standard error, which is not necessary with the power-divergence statistic. The main question we address in this treatment is whether or not the temporal dependence and/or contemporaneous correlation between models A and B affect the accuracy of the power-divergence test in this setting.

*p*is made and compared against the null hypothesized value, say

*q*, where for our purpose,

*q*= 1 −

*q*= 1/2. Then, the power-divergence statistic is given by

*λ*, Eq. (1) follows a

To create categories of “better” it is necessary to define “better.” For a given choice of loss function [e.g., RMSE, absolute error (AE), mean error (ME), etc.], the model with the more optimal value of the loss function is defined to be better. For the examples of RMSE, AE and ME, values closer to zero are the more optimal values. In the subsequent discussion, “better” will be defined through the loss differential, which is the loss function for model A minus the loss function for model B. Therefore, a negative loss differential for positively valued loss functions, such as RMSE and AE, means that model A is better because if it is negative, then model A has a lower (i.e., more optimal) and positive loss differential values mean model B is better. While the loss differential is not needed in this context, it is described here to facilitate an easier understanding of the more traditional, intensity based, comparisons such as the paired *t* test, which is a test on the average of the loss differential series. If, for example, AE loss is used to calculate the loss differential series, then the test is for the loss differential under AE loss; or the loss differential of AE loss.

## 3. Simulations and data

The test for frequency of “better” is defined through the difference in loss between the two forecast models. The loss can be any measure of performance, such as RMSE, mean-error (ME), mean absolute error (MAE), contingency-table statistics such as CSI, etc. If *o* is the observation series), then the loss differential series, say *d*, is the difference *μ _{d}* is the mean loss differential; that is, that the mean of the loss differential series is zero. Such a hypothesis is a test about the (average) intensity of the error difference between two forecast models. Here, the test is not concerned with the size of the errors but how often

*d*< 0 or

*d*> 0.

To analyze the accuracy of the power-divergence test, the same simulation procedure carried out by Hering and Genton (2011) to test their intensity-based competing forecast verification test, henceforth the HG test, is performed. Gilleland et al. (2018) used the same simulation strategy to compare the HG test along with several other popular tests from the weather forecast verification domain, which found the HG test to be the most accurate of all of them (see section 4a for an explanation of accuracy in this context), with the block bootstrap next in line (under dependence, or the usual bootstrap under independence). Gilleland (2020, section 4) provides code for performing these simulations, and presents appropriate bootstrap methods for obtaining confidence intervals on the mean (intensity) of the loss differential series.

The simulations provide two contemporaneously correlated loss series, each with temporal dependence. The contemporaneous correlation is controlled by a parameter *ρ* and the temporal dependence by a parameter *θ*. The case of *ρ* = 0 (*θ* = 0) implies the series are uncorrelated (each series is independent in time), and as *ρ* (*θ*) increases toward one, the contemporaneous correlation (temporal dependence) strengthens with *ρ* = 1 (*θ* = 1) implying perfect correlation (temporal dependence). The two series can be simulated to have standard deviation of *σ*_{A} and *σ*_{B}, respectively. For testing empirical test size, *σ*_{A} = *σ*_{B} = 1 is used, and for testing statistical power, series with a larger value of *σ*_{B} are simulated.

To demonstrate the testing procedure over a wide array of applications, three sets of applications have been chosen. The first is from 6-h turbulence forecasts called the Graphical Turbulence Guidence (GTG) algorithm (Sharman and Pearson 2017; Muñoz-Esparza and Sharman 2018; Muñoz-Esparza et al. 2020). These turbulence forecasts use version 3 of the High-Resolution Rapid Refresh (HRRR; Dowell et al. 2022; James et al. 2022) as the input NWP information for the 1 June 2018–30 September 2019 period, and correspond to two GTG algorithms: simpler regression (HGTG; Sharman and Pearson 2017) and a more complex machine-learning model based on regression trees (ML-GTG; Muñoz-Esparza et al. 2020). Here, 605 119 samples of paired eddy dissipation rate (EDR; m^{2/3} s^{−1}) from in situ aircraft observations are compared against two versions of the GTG forecasting algorithm, both based on the same underlying NWP model, the HRRR. The specific ML-GTG version employs the random forests technique with 100 trees of 30-layer maximum depth based on 32 input features, and is referred to as RFri60 in Muñoz-Esparza et al. (2020).

To diagnose whether or not the data are temporally dependent, autocorrelation function (ACF) and partial autocorrelation function (PACF) plots are useful. A brief description of these plots is given now for clarity. The ACF plots the correlation between all pairs of points separated by the same lag in time. The PACF is similar but conditions on previous lag terms (see Brockwell and Davis 2010; Shumway and Stoffer 2017; Gilleland et al. 2018, for more details). The abscissa of the ACF and PACF plots represents the length of temporal lag between pairs of data points. A less sophisticated, but still useful diagnostic is to plot values against their lag terms.^{3} Note that these plots do not reveal anything about contemporaneous correlation.

Figure 1 displays the ACF and PACF plots for the loss differential under absolute-error loss for 6-h forecasts of eddy dissipation rate (top row) as well as scatterplots in the form of two-dimensional histograms (bottom row) of the loss differential against lagged terms using the two versions of the GTG forecasts. The plots show no indication of any temporal dependence in the series, expected from the nature of the data, which are a collection of multiple aircraft at different heights and locations. The ACF plot will always have 1 for the 0-lag term because data are always perfectly correlated with themselves, and it is nearly zero for all subsequent lag terms. The PACF is not defined for the lag-zero term, and note that the values for all subsequent lag terms are also very close to zero. The 2D histogram scatterplots do not follow a straight line, so provide further evidence for the lack of temporal dependence in these data. However, there is nonnegligible contemporaneous correlation between the two loss series,

The next example concerns 12-h forecasts of 2-m temperature and 10-m wind speed. These data were extracted from the surface application of the Model Analysis Tool Suite (MATS; Turner et al. 2020). Matched observations are used here, and model forecast data are from 1 August 2019 to 1 December 2020, when version 3 of the HRRR was operational at the National Centers for Environmental Prediction (NCEP) and version 4 was frozen as part of the evaluation phase (Dowell et al. 2022). Figure 2 gives diagnostic plots for the loss differential series for the 2-m temperature mean-error loss between the HRRR versions 3 and 4 for 12-h forecasts. The ACF and PACF plots in Figs. 2a and 2b clearly show a strong temporal dependence as many values fall well outside the 95% confidence intervals (blue dashed lines); the ACF plot also clearly shows a cyclic trend in the dependence structure of the loss differential series. The two-dimensional histograms (scatterplots) of the loss differential series against its own first (Fig. 2c) and second (Fig. 2d) lag terms further corroborate that strong temporal dependence exists. In addition to the strong temporal dependence, and cyclic trends, in these data, the two loss series are strongly correlated with each other, having a correlation of about 0.71. In stark contrast to the EDR verification set, this set represents a common situation in weather forecast verification where severe departures from the assumptions of temporal independence exist, and strong contemporaneous correlation, as well as a diurnal trend.

As in Fig. 1, but for HRRR 2-m temperature 12-h forecasts under mean-error loss.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

As in Fig. 1, but for HRRR 2-m temperature 12-h forecasts under mean-error loss.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

As in Fig. 1, but for HRRR 2-m temperature 12-h forecasts under mean-error loss.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

Figure 3 shows the results for 12-h forecasts of 10-m wind speed data from the two versions of HRRR. Results are similar in that there is clear strong temporal dependence in these data. Again, there is strong contemporaneous correlation between the two HRRR versions’ AE loss series (≈0.84), as expected.

As in Fig. 2, but for 10-m wind speed data.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

As in Fig. 2, but for 10-m wind speed data.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

As in Fig. 2, but for 10-m wind speed data.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

## 4. Results

### a. Empirical size and power testing

In a hypothesis test, the size of the test refers to the probability of rejecting

Similarly, the power of a test refers to the probability of correctly rejecting

Recall from section 3 that the Hering and Genton (2011) simulation strategy is used in order to simulate data that mimic the temporal dependence behavior, and contemporaneous correlation, present in many meteorological verification sets. This same simulation strategy was also employed in Gilleland et al. (2018) and Gilleland (2020). The strength of temporal dependence is controlled through a tunable parameter *θ*, where independence is achieved when *θ* = 0 and the strength of dependence increases as *θ* approaches 1. Another tunable parameter *ρ* controls the strength of contemporaneous correlation between the two simulated error series, where again, *ρ* = 0 means they are independent and the strength of dependence increases with *ρ* where *ρ* = 1 is perfect dependence.

Figure 4 shows the main result of this paper in terms of empirical size for the power-divergence test of “better” more often. The tests are conducted at the 5% level, so a test with perfect accuracy would falsely reject the null hypothesis of equal frequency in the two cells (i.e., that neither forecast is better more often than the other) about 5% of the time and the bars would all line up at the horizontal dashed line in the barplots of the figure. As can be seen, the test tends to be slightly oversized, even in the case of independence where the theoretical justification for the test holds, and even for smaller sample sizes. The entire range for the empirical size tests does not exceed about 14%, so the test is reasonably accurate in terms of size for all choices of *λ* regardless of the departures of the underlying data from independence. Importantly, the results for the various types of dependence do not differ much from those for the IID case, suggesting that the underlying dependence in the data does not interfere with the

Empirical size results for the power divergence test at the 5% level. (a) The independence case (*θ* = 0, *ρ* = 0), (b) moderate temporal dependence (*θ* = 0.5), (c) moderate contemporaneous correlation (*ρ* = 0.5), (d) both moderate temporal dependence and contemporaneous correlation (*θ* = 0.5 and *ρ* = 0.5), and (e) strong dependence in both ways (*θ* = 0.9 and *ρ* = 0.9). For each choice of *λ* the bars show results for sample sizes of *n* = 10, 50, 100, 200, and 500, respectively, where the results for each choice of *λ* are separated by the dashed vertical lines.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

Empirical size results for the power divergence test at the 5% level. (a) The independence case (*θ* = 0, *ρ* = 0), (b) moderate temporal dependence (*θ* = 0.5), (c) moderate contemporaneous correlation (*ρ* = 0.5), (d) both moderate temporal dependence and contemporaneous correlation (*θ* = 0.5 and *ρ* = 0.5), and (e) strong dependence in both ways (*θ* = 0.9 and *ρ* = 0.9). For each choice of *λ* the bars show results for sample sizes of *n* = 10, 50, 100, 200, and 500, respectively, where the results for each choice of *λ* are separated by the dashed vertical lines.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

Empirical size results for the power divergence test at the 5% level. (a) The independence case (*θ* = 0, *ρ* = 0), (b) moderate temporal dependence (*θ* = 0.5), (c) moderate contemporaneous correlation (*ρ* = 0.5), (d) both moderate temporal dependence and contemporaneous correlation (*θ* = 0.5 and *ρ* = 0.5), and (e) strong dependence in both ways (*θ* = 0.9 and *ρ* = 0.9). For each choice of *λ* the bars show results for sample sizes of *n* = 10, 50, 100, 200, and 500, respectively, where the results for each choice of *λ* are separated by the dashed vertical lines.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

The empirical tests carried out demonstrate that the reduction of the loss differential series into two categories of “better” or “worse” results in a “frequency-of-better” test that is robust to even severe departures of the underlying assumptions regarding independent data in terms of committing the undesirable type-I error of rejecting a null hypothesis that is true. It is also important to guard against the type-II error of not rejecting a null hypothesis that is false. If *β* represents the probability of a type-II error, then 1 − *β* is the power of the test. To test for power, simulations similar to those for size are made when the error of the competing forecast has a standard deviation that is twice as large as the other forecast errors; in particular, the standard deviation for forecast A’s error is *σ*_{A} = 1 and for the competing forecast is *σ*_{B} = 2. That is, on average, the errors of the competing forecast should be larger so that forecast A should be “better” more often than forecast B. The power of the test will vary depending on different values of this standard deviation term where the larger *σ*_{B} is relative to *σ*_{A}, the more powerful the test will be, and the choice of 2 is a relatively low choice that is still large enough to be important.

Figure 5 shows the results for the empirical power testing. Again, there is no clear distinction in results for the different types of dependence as each panel in the figure is similar to that for the case of independence (top-left panel). For power, the ideal is to have power larger than 80%, as is found for the Cressie–Read (CR) test (*λ* = 2/3). The test is otherwise under powered, although *λ* = 0 is in the area of a respectable 60% empirical power.

As in Fig. 4, but for empirical power instead of size.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

As in Fig. 4, but for empirical power instead of size.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

As in Fig. 4, but for empirical power instead of size.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

The power results in Fig. 5 are interesting because the empirical power is relatively good, but not great, for *λ* = 0 but takes a dip in power for *λ* = 1/2. The power is most optimal for *λ* = 2/3 and is too high for *λ* = 2 and when sample sizes are small, for *λ* = 5. Values of *λ* below zero have little empirical power. Combined with the empirical size tests, these results demonstrate that the power-divergence test for comparing frequencies of one forecast’s being “better” than another is relatively accurate and high-powered for certain choices of *λ*, with *λ* = 2/3 being an optimal choice. The result provides further evidence that *λ* = 2/3 is a good compromise between some of the other goodness-of-fit measures in common use.

One final piece of technical information is important to understand regarding the above empirical testing procedure. Typically, a new forecast model, or modification of an existing one, is hypothesized to be “better” in some sense than the model it is intended to replace. Therefore, the interest is in what is called a one-sided test, so the testing performed here is for one-sided hypothesis testing. If it were only of interest to know whether or not one forecast model is better more often than the other, without regard for which one, then a two-sided test would be performed, which would require adjusting the size of the test in each direction from *α* to *α*/2 (Cortina and Dunlap 1997; Cox et al. 1977; Meehl 1967; Tukey 1991; Rubin 2020).

Clearly, from both the empirical size and power testing, the choice of *λ* = 2/3 is optimal for the situations discussed in this paper. Therefore, our recommendation to potential users of this statistic is to use *λ* = 2/3 when testing for the frequency-of-better. However, we continue to demonstrate the statistic using the nine chosen values (also used by Read and Cressie 1988) in this paper for those readers who are interested in how the test behaves for different choices.

### b. GTG model comparison example

The top half of Table 1 reports the numbers of times each GTG model outperforms the other using ME loss. Observed EDR is rarely above 0.1 m^{2/3} s^{−1} (less than 1% of the time for the current data series), and it is important to avoid issues with rare events. For example, Muñoz-Esparza et al. (2020) found that the two models had similar skill for values above 0.1 m^{2/3} s^{−1}, but that the HGTG model slightly outperformed the ML GTG for light-to-moderate turbulence associated with EDR within the range of (0.1, 0.3) m^{2/3} s^{−1}, whereas the ML GTG was found to be statistically significantly better for severe turbulence (EDR > 0.3 m^{2/3} s^{−1}), which constitutes only about 0.1% of the entire sample.

Results of “frequency-of-better” for the GTG EDR models under mean-error (ME) loss and absolute-error (AE) loss.

Following Muñoz-Esparza et al. (2020), the frequency-of-better test is performed for the categories shown in the table, rather than on the entire data, thereby avoiding the rare event issue. Fortunately, there remains sufficient data in each of these categories to conduct the test; that is, that the asymptotic assumptions for the *χ*^{2} distribution to hold are reasonable. Table 1 clearly shows that the ML GTG model is better under ME loss for null, very light, and light cases, where the HGTG model is better more often for moderate and severe cases. The question to be addressed by the power-divergence test is whether or not these results are likely to have been arrived at by chance or whether it is likely that one model truly outperforms the other more often.

The bottom half of Table 1 shows the results for the GTG models using AE loss. While it shows that the ML GTG is better more often for the null and very light stratifications, it suggests that the HGTG model is better more often for light cases. They also disagree for the severe case. The seeming contradiction is a caveat of ME loss, which can have errors on either side of zero cancel each other out. Therefore, the AE loss is generally more useful in this setting.

The power divergence test results under ME loss for moderate turbulence conditions fail to reject *λ* with the estimated *p*-value at zero. For AE loss the only nonsignificant result is for the severe case, suggesting that while ML GTG is better more often, the result could be by chance. However, the *p*-value for this case (regardless of the choice of *λ*) is about 0.13, which is around the high end of the empirical size of the test found previously. Therefore, caution is recommended in drawing conclusions about significance here.

Muñoz-Esparza et al. (2020) compared skill for these models using intensity-based measures, which complements the frequency-of-better statistic suggested here. Based on the same mean error, they found that the ML GTG was better than HGTG for each category of null, very light, light, moderate and severe, while for MAE, they did find the HGTG to be better.

### c. HRRR 2-m temperature example

Figure 6 displays the main results for the HRRR version 3 versus HRRR version 4 2-m temperature data. In total, HRRR version 3 is the better model 1349 times, while version 4 is better 7673 times. On the other hand, the boxplots in Fig. 6a show that the distributions of differences in ME loss, while better for version 4, are not highly improved. The boxplots showing their loss differentials in Fig. 6b display a clear cyclic trend by hour of the day. It is, therefore, important to treat each hour separately in any analysis so as to avoid Simpson’s paradox whereby a confounding variable (in this case time of day) may obfuscate results if not taken into account. Related to the results in Fig. 6b are the results in the ACF plot of Fig. 2, which shows that the cyclic behavior also affects the dependence structure in the data.

(a) Boxplots for the mean-error loss of 12-h forecasts of 2-m temperature (°C) for each version of HRRR, (b) boxplots of the loss differential by hour, and (c) the resulting *p*-values of the HG test applied to each hour individually.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

(a) Boxplots for the mean-error loss of 12-h forecasts of 2-m temperature (°C) for each version of HRRR, (b) boxplots of the loss differential by hour, and (c) the resulting *p*-values of the HG test applied to each hour individually.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

(a) Boxplots for the mean-error loss of 12-h forecasts of 2-m temperature (°C) for each version of HRRR, (b) boxplots of the loss differential by hour, and (c) the resulting *p*-values of the HG test applied to each hour individually.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

To compare intensities, the HG test (Hering and Genton 2011) is employed, which is simply the usual paired *t* test but where the standard error is estimated by a weighted average of lag terms from having fit a parametric autocovariance function^{4} to the empirical one of the loss differential series (Hering and Genton 2011). That is, this test directly accounts for the temporal dependence, and was found to be robust to contemporaneous correlation. As with the frequency-of-better test, this test is conducted for each hour separately, which in addition to avoiding Simpson’s paradox, also avoids issues with the cyclic trend. The *p*-values from these tests are shown in Fig. 6c. The test rejects the null hypothesis that the expected loss differential is zero (no diference between the two HRRR versions) for parts of the day, but for the hours of 0800–1200 UTC the test clearly fails to reject this hypothesis. The power-divergence test does reject the null hypothesis that version 3 is no better (or worse) than version 4 more often at each hour of the day, except between 0900 and 1200 UTC where there is no significant difference in the frequency of being better. That is, apart from a couple of hours in the late morning, version 4 is better more often than version 3 with statistical significance.

The bar plots in Fig. 7a show the numbers of times each version of HRRR is better than the other for 2-m temperature (°C). Clearly, version 4 is better more often at most valid times. However, at some valid times, it is only better slightly more often. At 1000, 1100, and 1200 UTC, version 3 is better more often, though only slightly more often at 1200 UTC.

Bar plots showing the number of times each HRRR version is better (for each valid time) than the other for 12-h forecasts of (a) 2-m temperature (°C) and (b) 10-m wind speed (m s^{−1}) using ME loss.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

Bar plots showing the number of times each HRRR version is better (for each valid time) than the other for 12-h forecasts of (a) 2-m temperature (°C) and (b) 10-m wind speed (m s^{−1}) using ME loss.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

Bar plots showing the number of times each HRRR version is better (for each valid time) than the other for 12-h forecasts of (a) 2-m temperature (°C) and (b) 10-m wind speed (m s^{−1}) using ME loss.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

Figures 7b and 8 show the results for the loss differential between the two HRRR models for 10-m wind speed data. Again, a cyclic pattern in the errors is present (Fig. 8), if not as pronounced. Curiously, however, it shows that version 3 is better more often than version 4 because the median, and most of the box, of the box plots for the loss-differential series falls below the zero-line for most hours of the day; only from about 1900 to 0200 UTC do the box plots shift up, and even then the medians only go above zero from about 2100 to 0000 UTC.

As in Fig. 6, but for 10-m wind speed data.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

As in Fig. 6, but for 10-m wind speed data.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

As in Fig. 6, but for 10-m wind speed data.

Citation: Weather and Forecasting 38, 9; 10.1175/WAF-D-22-0201.1

The power-divergence test for frequency-of-better results in the same *p*-values at each time point but differs depending on the choice of *λ*. For *λ* ∈ {−5, −2, −1}, the test fails to reject the null hypothesis that the two models are different. On the other hand, for *λ* ∈ {−1/2, 0, 1/2, 2/3, 1, 2, 5} the test rejects the null hypothesis. Read and Cressie (1988, chapter 5) summarizes the conditions when each choice of *λ* should be used. For the purposes of the current test, and because there are enough data in each of the two cells (*k* = 1, 2) at each time point, and because the null hypothesis is the equiprobable one, the choice of larger positive values of *λ* is recommended. Indeed, for most purposes of employing this frequency-of-better test, a choice of *λ* = 2/3 (i.e., the CR test) is ideal as it is accurate, in terms of test size, powerful, and the most robust to situations where one cell has either few or no entries (i.e., zero percent) or where one cell has a considerably larger number of entries. Moreover, per Fig. 5, in this setting, tests for more negative values of *λ* fail to reject

## 5. Conclusions

It is very common in weather forecast verification to perform statistical hypothesis tests to compare two competing forecast models. Most of the time, however, these tests only evaluate whether one model is better based on an average (intensity) error, such as a reduced RMS error relative to observations. While important, we propose that another important measure to consider concerns how often one model is better than the other, even if the differences in the loss functions between the models is only slight. The power-divergence testing procedure is shown empirically to be an accurate and powerful test for at least some choices of its indexing parameter *λ* for testing the “frequency-of-better.” In particular, it is shown that these test results are reasonably robust to even severe departures from the assumption of independence; both temporal and contemporaneous.

As far as which choice of *λ* should be used, for most purposes *λ* = 2/3, the so-called Cressie–Read (CR) test, is the ideal choice. For our empirical testing, it has good (empirical) power, as well as high accuracy in terms of test size. That is, the false rejection rate is close to the a priori chosen test size *α*. These results are in accord with previous results by Cressie and Read (1984) who introduced this choice as a new goodness-of-fit statistic.

Two separate modeling examples are employed to demonstrate the technique and illustrate the types of issues that need to be considered. In the first, turbulence is considered for different categories of severity according to eddy dissipation rate (EDR) using two versions of the Graphical Turbulence Guidance (GTG) algorithm (Sharman and Pearson 2017; Muñoz-Esparza and Sharman 2018; Muñoz-Esparza et al. 2020). An issue to consider with these data concern effects of rare occurrences that can otherwise be diluted because of numerous nonturbulent environments that make up most of the dataset.

A second example involves two versions of the High-Resolution Rapid Refresh (HRRR), where 12-h forecasts of both 2-m temperature and 10-m wind speed are investigated. These data are marked by strong contemporaneous correlation between the two versions of the model, and a loss differential series with strong temporal dependence and a diurnal cycle in its magnitude. An intensity test known as the Hering–Genton (HG) test is employed for comparison against the power-divergence frequency-of-better test because it is also found to be robust to contemporaneous correlation and it directly accounts for the temporal dependence. The cyclic trend needs to be removed, or as is done here, for each hour to be tested separately. Significant results for either temperature or wind speed are only found at a few hours from the HG test. However, significant results in terms of frequency-of-better are found for most hours and both variables.

Previous studies have found that contemporaneous correlation can be problematic for many common statistical tests in the comparative forecast verification domain (cf. Hering and Genton 2011; DelSole and Tippett 2014; Gilleland et al. 2018).

It is not truly an exact test because *p* is unknown and must be estimated from the data.

A lagged term means that it is the same data series but at a different point in time. So, a scatter plot of the data against the lag-1 data means that the data series is plotted against itself but one time-point back in time. A completely dependent series would make a straight line with slope 1.

An autocovariance function is analogous to the ACF described earlier, but where the covariances are used instead of correlations.

In this paper, the number of times one model is better than the other is used rather than the proportion, but either can be used, and the language of proportion is perhaps clearer in this context.

## Acknowledgments.

This material is based upon work supported by the National Center for Atmospheric Research, which is a major facility sponsored by the National Science Foundation under Cooperative Agreement 1852977. DME’s contributions were partly supported by the Federal Aviation Administration (FAA). The views expressed are those of the authors and do not necessarily represent the official policy or position of the FAA. A portion of this work was supported by NOAA’s Atmospheric Science for Renewable Energy (ASRE) Program. The first author thanks Noel Cressie for conversations that led to this paper, as well as subsequent feedback during the process.

## Data availability statement.

The turbulence dissipation forecasts from GTG are available from the authors upon request. The HRRR data, and the associated observations from the operational ASOS stations, are averages over the eastern CONUS. These data are available from the Model Analysis Tool Suite (MATS; https://www.esrl.noaa.gov/gsd/mats/) using the “Surface” application. The date range used in this analysis is from 1 August 2019 to 1 December 2022, which is when version 3 of the HRRR was being run operationally at NCEP and version 4 was frozen as part of the release process. See Dowell et al. (2022) for details on these two modeling systems.

## APPENDIX

### Background on the Power-Divergence Statistic

The power-divergence statistic encompasses many common goodness-of-fit statistics for comparing the distributions between two populations, or between a sample and a parametric distribution. The Kullback–Leibler statistic, for example, has been used considerably in weather forecast verification (e.g., Weijs et al. 2010). It is a categorical summary where each version is essentially a combination of either ratios or differences between the cell frequencies for the two distributions being compared, where the specific combination is determined through the choice of a user-chosen parameter *λ* (see Table A1 for a list of some common choices).

Some properties of the power-divergence statistic of Eq. (A1) summarized from Read and Cressie (1988). The cases of *λ* = −1 and 0 are defined by continuity.

The application of interest, here, concerns testing whether one forecast model is “better” more often than a competing forecast model. Ties will be ignored as the sense of better is defined to be in terms of a loss function [e.g., root-mean square error (RMSE), critical success index (CSI), etc.], which is a continuous statistic, so the probability of a tie is zero almost surely. Thus, there are two categories, *k* = 1, 2: 1) forecast A is better and 2) forecast B is better. The null hypothesis will be the equiprobable hypothesis that the proportion of times forecast A is better,^{5} labeled *p*_{1}, is equal to the number of times that forecast B is better, labeled *p*_{2}. That is, *q*_{1} and the observed proportion of times that forecast B is better is labeled *q*_{2}. The power-divergence statistic can be applied to an arbitrary number of categories, *k*, and not just *k* = 2. The description that follows is for the more general setting of *k* > 1 categories.

*λ*< ∞,

**q**= (

*q*

_{1}, …,

*q*) is the null-hypothesis distribution, and

_{k}**p**= (

*p*

_{1}, …,

*p*) is the distribution estimated from the data. In the present case,

_{k}*k*= 2,

**q**= (

*q*

_{1},

*q*

_{2}) = (

*q*, 1 −

*q*) = (50%, 50%). Table A1 summarizes some of the properties of the power-divergence statistic from Read and Cressie (1988).

Cressie and Read (1984) showed by way of a Taylor-series expansion that *I ^{λ}*(

**p**:

**q**) follows the same

*λ*; indeed, under the classical assumptions when the null model is true, all of the members of the power-divergence family are asymptotically equivalent as the sample size approaches infinity. It is possible, however, to obtain seemingly conflicting results among the different choices. However, each test is optimal for different hypotheses. The optimal choice of which

*λ*to use is partially summarized in Table A1. Generally,

*λ*should be chosen within the range of about (−1, 5), with larger values of

*λ*used to detect a departure from the null hypothesis model that involves one cell with a large ratio of the alternative/null expected frequency and larger negative values when one cell has a near-zero value for this ratio (Read and Cressie 1988, sections 5.5 and 6.7). When values of Eq. (A1) are similar for different choices of

*λ*then it is indicative that no single cell provides a major contribution to the lack of fit. Otherwise, it can be deduced that one or two cells have a large departure from their expected frequencies. When it is not desired to weigh a single cell too heavily, then choices of

*λ*closer to zero are warranted. See Read and Cressie (1988) for a more thorough discussion regarding the choice of

*λ*.

One may note that Eq. (A1) is not always symmetric in that *I ^{λ}*(

**p**:

**q**) ≠

*I*(

^{λ}**q**:

**p**) for all choices of

*λ*. For example, the Kullback–Leibler (KL) and loglikelihood (

*G*

^{2}) statistics differ only in the order in which

**p**and

**q**are entered; and are thus sometimes referred to as directed divergence measures. A symmetrical form is readily attained as KL +

*G*

^{2}, which is attributed to Jeffreys (1998). An even more generalized statistic, that includes the power-divergence statistic, is known as the phi-divergence statistic (Jager and Wellner 2007), and encompasses even more familiar goodness-of-fit statistics, such as the Anderson–Darling (Anderson and Darling 1952).

## REFERENCES

Anderson, T. W., and D. A. Darling, 1952: Asymptotic theory of certain “goodness-of-fit” criteria based on stochastic processes.

,*Ann. Math. Stat.***23**, 193–212, https://doi.org/10.1214/aoms/1177729437.Brockwell, P. J., and R. A. Davis, 2010:

*Introduction to Time Series and Forecasting*. 2nd ed. Springer, 437 pp.Cortina, J. M., and W. P. Dunlap, 1997: On the logic and purpose of significance testing.

,*Psychol. Methods***2**, 161–172, https://doi.org/10.1037/1082-989X.2.2.161.Cox, D. R., E. Spjøtvoll, S. Johansen, W. R. van Zwet, J. F. Bithell, O. Barndorff-Nielsen, and M. Keuls, 1977: The role of significance tests.

,*Scand. J. Stat.***4**, 49–70.Cressie, N., and T. R. C. Read, 1984: Multinomial goodness-of-fit tests.

,*J. Roy. Stat. Soc.***46**, 440–464.DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill.

,*Mon. Wea. Rev.***142**, 4658–4678, https://doi.org/10.1175/MWR-D-14-00045.1.Dowell, D. C., and Coauthors, 2022: The High-Resolution Rapid Refresh (HRRR): An hourly updating convection-allowing forecast model. Part I: Motivation and system description.

,*Wea. Forecasting***37**, 1371–1395, https://doi.org/10.1175/WAF-D-21-0151.1.Drezner, Z., and N. Farnum, 1993: A generalized binomial distribution.

,*Commun. Stat. Theory Methods***22**, 3051–3063, https://doi.org/10.1080/03610929308831202.Freeman, M. F., and J. W. Tukey, 1950: Transformations related to the angular and the square root.

,*Ann. Math. Stat.***21**, 607–611, https://doi.org/10.1214/aoms/1177729756.Gilleland, E., 2020: Bootstrap methods for statistical inference. Part I: Comparative forecast verification for continuous variables.

,*J. Atmos. Oceanic Technol.***37**, 2117–2134, https://doi.org/10.1175/JTECH-D-20-0069.1.Gilleland, E., A. S. Hering, T. L. Fowler, and B. G. Brown, 2018: Testing the tests: What are the impacts of incorrect assumptions when applying confidence intervals or hypothesis tests to compare competing forecasts?

,*Mon. Wea. Rev.***146**, 1685–1703, https://doi.org/10.1175/MWR-D-17-0295.1.Hering, A. S., and M. G. Genton, 2011: Comparing spatial predictions.

,*Technometrics***53**, 414–425, https://doi.org/10.1198/TECH.2011.10136.Jager, L., and J. A. Wellner, 2007: Goodness-of-fit tests via phi-divergences.

,*Ann. Stat.***35**, 2018–2053, https://doi.org/10.1214/0009053607000000244.James, E. P., and Coauthors, 2022: The High-Resolution Rapid Refresh (HRRR): An hourly updating convection-allowing forecast model. Part II: Forecast performance.

,*Wea. Forecasting***37**, 1397–1417, https://doi.org/10.1175/WAF-D-21-0130.1.Jeffreys, H., 1998:

*Theory of Probability*. 3rd ed. Oxford University Press, 470 pp.Kullback, S., and R. A. Leibler, 1951: On information and sufficiency.

,*Ann. Math. Stat.***22**, 79–86, https://doi.org/10.1214/aoms/1177729694.Meehl, P. E., 1967: Theory-testing in psychology and physics: A methodological paradox.

,*Philos. Sci.***34**, 103–115, https://doi.org/10.1086/288135.Muñoz-Esparza, D., and R. Sharman, 2018: An improved algorithm for low-level turbulence forecasting.

,*J. Appl. Meteor. Climatol.***57**, 1249–1263, https://doi.org/10.1175/JAMC-D-17-0337.1.Muñoz-Esparza, D., R. Sharman, and W. Deierling, 2020: Aviation turbulence forecasting at upper levels with machine learning techniques based on regression trees.

,*J. Appl. Meteor. Climatol.***59**, 1883–1899, https://doi.org/10.1175/JAMC-D-20-0116.1.Neyman, J., 1949: Contribution to the theory of the

*χ*2 test.*Proc. First Berkeley Symp. on Mathematical Statistics and Probability*, Berkeley, CA, University of California, 239–273, https://digitalassets.lib.berkeley.edu/math/ucb/text/math_s1_article-14.pdf.Pearson, K., 1900: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.

,*Philos. Mag.***50**, 157–175, https://doi.org/10.1080/14786440009463897.Read, T. R. C., and N. A. C. Cressie, 1988:

*Goodness-of-Fit Statistics for Discrete Multivariate Data*. 1st ed. Springer-Verlag, 212 pp.Rubin, M., 2020: That’s not a two-sided test! It’s two one-sided tests!

,*Significance***17**, 38–41, https://doi.org/10.1111/1740-9713.01405.Sharman, R. D., and J. M. Pearson, 2017: Prediction of energy dissipation rates for aviation turbulence. Part I: Forecasting nonconvective turbulence.

,*J. Appl. Meteor. Climatol.***56**, 317–337, https://doi.org/10.1175/JAMC-D-16-0205.1.Shumway, R. H., and D. S. Stoffer, 2017:

*Time Series Analysis and its Applications: With R Examples*. 4th ed. Springer International Publishing, 562 pp.Singh, D., and S. Kumar, 2020: Limit theorems for sums of dependent and non-identical Bernoulli random variables.

,*Amer. J. Math. Manage. Sci.***39**, 150–165, https://doi.org/10.1080/01966324.2019.1673266.Tukey, J. W., 1991: The philosophy of multiple comparisons.

,*Stat. Sci.***6**, 100–116, https://doi.org/10.1214/ss/1177011945.Turner, D. D., and Coauthors, 2020: A verification approach used in developing the Rapid Refresh and other numerical weather prediction models.

,*J. Oper. Meteor.***8**, 39–53, https://doi.org/10.15191/nwajom.2020.0803.Weijs, S. V., R. van Nooijen, and N. van de Giesen, 2010: Kullback–Leibler divergence as a forecast skill score with classic reliability–resolution–uncertainty decomposition.

,*Mon. Wea. Rev.***138**, 3387–3399, https://doi.org/10.1175/2010MWR3229.1.Woodbury, M. A., 1949: On a probability distribution.

,*Ann. Math. Stat.***20**, 311–313, https://doi.org/10.1214/aoms/1177730043.