Corrigendum

Benedikt Schulz aKarlsruhe Institute of Technology, Karlsruhe, Germany

Search for other papers by Benedikt Schulz in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0003-1367-1543
and
Sebastian Lerch aKarlsruhe Institute of Technology, Karlsruhe, Germany
bHeidelberg Institute for Theoretical Studies, Heidelberg, Germany

Search for other papers by Sebastian Lerch in
Current site
Google Scholar
PubMed
Close
Free access

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

This article is included in the Waves to Weather (W2W) Special Collection.

Corresponding author: Benedikt Schulz, benedikt.schulz2@kit.edu

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

This article is included in the Waves to Weather (W2W) Special Collection.

Corresponding author: Benedikt Schulz, benedikt.schulz2@kit.edu
In Schulz and Lerch (2022), there was an error in the calculation of the Diebold–Mariano (DM; Diebold and Mariano 1995) test statistics that affected Table 5 and Fig. 7 of that paper but not the overall conclusions drawn in the study. In section d of appendix A, the third equation (the variance term) is incorrect because a square root transformation in the variance term is missing, and it instead should read as
σ^n2=1ni=1n[S(Fi,yi)S(Gi,yi)]2
(i.e., a square or power of 2 has been added to the left-hand-side term). Note that we estimate the variance without centering [following, e.g., Gneiting and Katzfuss (2014)]. With centering, the DM test is equivalent to a standard t test for the score difference being equal to 0 in expectation. Both variants are valid estimators of the variance under the null hypothesis. We refer to Jordan (2016) for details.1

The code was also subject to this error,2 and hence the original Table 5 and Fig. 7 are recalculated using the correct formula. Corrected versions are displayed here in Table 5 and Fig. 7. The conclusions drawn in Schulz and Lerch (2022) still hold, but the numbers stated in one paragraph need to be updated. The third paragraph of section 4d should now read as follows:

Table 5

Ratio of lead time–station combinations (%) where pairwise DM tests indicate statistically significant CRPS differences after applying a Benjamini–Hochberg procedure to account for multiple testing for a nominal level of α = 0.05 of the corresponding one-sided tests. The (i, j) entry in the ith row and jth column indicates the ratio of cases where the null hypothesis of equal predictive performance of the corresponding one-sided DM test is rejected in favor of the model in the ith row when compared with the model in the jth column. The remainder of the sum of (i, j) and (j, i) entry to 100% is the ratio of cases for which the score differences are not significant.

Table 5
Fig. 7.
Fig. 7.

Best method at each station in terms of the CRPS, averaged over all lead times. The point sizes indicate the level of statistical significance of the observed CRPS differences relative to the methods only from the other groups of methods for all lead times. Three different point sizes are possible, with the smallest size indicating statistically significant differences for at most 90% of the performed tests, the middle size is for up to 99%, and the largest is to 100%, meaning all differences are statistically significant.

Citation: Monthly Weather Review 151, 5; 10.1175/MWR-D-23-0010.1

We find that the observed score differences are statistically significant for a high ratio of stations and lead times; see Table 5. In particular, DRN and BQN significantly outperform the basic models at more than 94%, and even significantly outperform QRF and EMOS-GB at more than 50% of all combinations of stations and lead times. Among the locally estimated methods, QRF performs best but only provides significant improvements over the NN-based methods for around 1% of the cases.

That is, the three percentages have been reduced from 97%, 80%, and 5% to 94%, 50%, and 1%, respectively.

In general, the ratio of significant tests in Table 5 decreases. Especially the comparison of the network methods DRN, BQN, and HEN are affected, where DRN and BQN now perform significantly better than the other method at less than 2% instead of at ∼44%–46%. For HEN, QRF, and EMOS-GB, the ratio of significant tests for the superiority of DRN and BQN also decreases by ∼25–40 percentage points.

In comparing the original Fig. 7 in Schulz and Lerch (2022) with the revised Fig. 7, it is seen that the size of symbols decreases because of the decreased ratio of significant tests. Note that the best-performing method at each location is not affected and does not change.

1

We thank Ron McTaggart-Cowan, Michael Scheuerer, and Alexander Jordan for constructive discussions on the different variants of the DM test.

2

In addition to this Corrigendum, we updated the code on the corresponding GitHub repository (https://github.com/benediktschulz/paper_pp_wind_gusts).

REFERENCES

Save
  • Fig. 7.

    Best method at each station in terms of the CRPS, averaged over all lead times. The point sizes indicate the level of statistical significance of the observed CRPS differences relative to the methods only from the other groups of methods for all lead times. Three different point sizes are possible, with the smallest size indicating statistically significant differences for at most 90% of the performed tests, the middle size is for up to 99%, and the largest is to 100%, meaning all differences are statistically significant.

All Time Past Year Past 30 Days
Abstract Views 599 9 0
Full Text Views 5496 5394 2041
PDF Downloads 165 87 2