• Arcones, M. A., and E. Giné, 1989: The bootstrap of the mean with arbitrary bootstrap sample. Ann. Inst. Henri Poincaré, 25, 457481.

    • Search Google Scholar
    • Export Citation
  • Arcones, M. A., and E. Giné, 1991: Additions and corrections to “The bootstrap of the mean with arbitrary bootstrap sample.” Ann. Inst. Henri Poincaré, 27, 583595.

    • Search Google Scholar
    • Export Citation
  • Athreya, K. B., 1987a: Bootstrap of the mean in the infinite variance case. Proc. First World Congress of the Bernoulli Society, Utrecht, Netherlands, Bernoulli Society, 95–98.

  • Athreya, K. B., 1987b: Bootstrap of the mean in the infinite variance case. Ann. Stat., 15, 724731, https://doi.org/10.1214/aos/1176350371.

  • Azzalini, A., and A. Genz, 2016: mnormt: The Multivariate Normal and t Distributions, version 1.5-5. R package, https://azzalini.stat.unipd.it/SW/Pkg-mnormt.

    • Crossref
    • Export Citation
  • Bickel, P. J., and D. A. Freedman, 1981: Some asymptotic theory for the bootstrap. Ann. Stat., 9, 11961217, https://doi.org/10.1214/aos/1176345637.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bickel, P. J., F. Götze, and W. R. van Zwet, 1997: Resampling fewer than n observations: Gains, losses, and remedies for losses. Stat. Sin., 7, 131.

    • Search Google Scholar
    • Export Citation
  • Brockwell, P. J., and R. A. Davis, 1996: Introduction to Time Series and Forecasting. Springer, 420 pp.

    • Crossref
    • Export Citation
  • Canty, A., and B. Ripley, 2017: boot: Bootstrap R (S-Plus) Functions, version 1.3-20. R package, http://statwww.epfl.ch/davison/BMA/.

  • Carpenter, J., 1999: Test inversion bootstrap confidence intervals. J. Roy. Stat. Soc., 61B, 159172, https://doi.org/10.1111/1467-9868.00169.

  • Carpenter, J., and J. Bithell, 2000: Bootstrap confidence intervals: When, which, what? A practical guide for medical statisticians. Stat. Med., 19, 11411164, https://doi.org/10.1002/(SICI)1097-0258(20000515)19:9<1141::AID-SIM479>3.0.CO;2-F.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Davison, A., and D. Hinkley, 1997: Bootstrap Methods and Their Application. Cambridge University Press, 582 pp.

    • Crossref
    • Export Citation
  • Deheuvels, P., D. M. Mason, and G. R. Shorack, 1993: Some results on the influence of extremes on the bootstrap. Ann. Inst. Henri Poincaré Probab. Stat., 29, 83103.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill. Mon. Wea. Rev., 142, 46584678, https://doi.org/10.1175/MWR-D-14-00045.1.

  • Efron, B., 1979: Bootstrap methods: Another look at the jackknife. Ann. Stat., 7, 126, https://doi.org/10.1214/aos/1176344552.

  • Efron, B., and R. Tibshirani, 1993: An Introduction to the Bootstrap. Vol. 57. Chapman and Hall, 436 pp.

    • Crossref
    • Export Citation
  • Feigin, P., and S. I. Resnick, 1997: Linear programming estimators and bootstrapping for heavy-tailed phenomena. Adv. Appl. Probab., 29, 759805, https://doi.org/10.2307/1428085.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Fukuchi, J.-I., 1994: Bootstrapping extremes of random variables. Ph.D. thesis, Dept. of Statistics, Iowa State University, 96 pp., https://doi.org/10.31274/rtd-180813-10322.

    • Crossref
    • Export Citation
  • Garthwaite, P. H., and S. T. Buckland, 1992: Generating Monte Carlo confidence intervals by the Robbins-Monro process. Appl. Stat., 41, 159171, https://doi.org/10.2307/2347625.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Tech. Rep. NCAR/TN-479+STR, 71 pp.

  • Gilleland, E., 2013: Testing competing precipitation forecasts accurately and efficiently: The spatial prediction comparison test. Mon. Wea. Rev., 141, 340355, https://doi.org/10.1175/MWR-D-12-00155.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gilleland, E., 2017: distillery: Method Functions for Confidence Intervals and to Distill Information from an Object, version 1.0-4. R package, https://www.ral.ucar.edu/staff/ericg.

  • Gilleland, E., A. S. Hering, T. L. Fowler, and B. G. Brown, 2018: Testing the tests: What are the impacts of incorrect assumptions when applying confidence intervals or hypothesis tests to compare competing forecasts? Mon. Wea. Rev., 146, 16851703, https://doi.org/10.1175/MWR-D-17-0295.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Giné, E., and J. Zinn, 1989: Necessary conditions for the bootstrap of the mean. Ann. Stat., 17, 684691, https://doi.org/10.1214/aos/1176347134.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hall, P., 1990: Asymptotic properties of the bootstrap for heavy-tailed distributions. Ann. Probab., 18, 13421360, https://doi.org/10.1214/aop/1176990748.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamilton, J. D., 1994: Time Series Analysis. Princeton University Press, 799 pp.

    • Crossref
    • Export Citation
  • Hannig, J., 2009: On generalized fiducial inference. Stat. Sin., 19, 491544.

  • Hannig, J., H. K. Iyer, and P. Patterson, 2006: Fiducial generalized confidence intervals. J. Amer. Stat. Assoc., 101, 254269, https://doi.org/10.1198/016214505000000736.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hannig, J., H. K. Iyer, and C. M. Wang, 2007: Fiducial approach to uncertainty assessment accounting for error due to instrument resolution. Meteorologia, 44, 476483, https://doi.org/10.1088/0026-1394/44/6/006.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hering, A. S., and M. G. Genton, 2011: Comparing spatial predictions. Technometrics, 53, 414425, https://doi.org/10.1198/TECH.2011.10136.

  • Jolliffe, I. T., 2007: Uncertainty and inference for verification measures. Wea. Forecasting, 22, 637650, https://doi.org/10.1175/WAF989.1.

  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, 254 pp.

  • Kabaila, P., 1993: Some properties of profile bootstrap confidence intervals. Aust. J. Stat., 35, 205214, https://doi.org/10.1111/j.1467-842X.1993.tb01326.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kinateder, J. G., 1992: An invariance principle applicable to the bootstrap. Exploring the Limits of Bootstrap, R. Lepage and L. Billard, Eds., Wiley Series in Probability and Mathematical Statistics, Wiley, 157–181.

  • Knight, K., 1989: On the bootstrap of the sample mean in the infinite variance case. Ann. Stat., 17, 11681175, https://doi.org/10.1214/aos/1176347262.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lahiri, S. N., 2003: Resampling Methods for Dependent Data. Springer, 374 pp.

    • Crossref
    • Export Citation
  • Lee, S., 1999: On a class of m out of n bootstrap confidence intervals. J. Roy. Stat. Soc., 61B, 901911, https://doi.org/10.1111/1467-9868.00209.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • LePage, R., 1992: Bootstrapping signs. Exploring the Limits of Bootstrap, R. Lepage and L. Billard, Eds., Wiley Series in Probability and Mathematical Statistics, Wiley, 215–224.

  • Lidong, E., J., Hannig, and H. Iyer, 2008: Fiducial intervals for variance components in an unbalanced two-component normal mixed linear model. J. Amer. Stat. Assoc., 103, 854865, https://doi.org/10.1198/016214508000000229.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Liu, Y., and Coauthors, 2009: An operational mesoscale ensemble data assimilation and prediction system: E-RTFDDA—System design and verification. 19th Numerical Weather Prediction Conf./23rd Weather Forecasting Conf., Omaha, NE, Amer. Meteor. Soc., 16A.4, https://ams.confex.com/ams/23WAF19NWP/techprogram/paper_154271.htm.

  • Mood, A. M., F. A. Graybill, and D. C. Boes, 1963: Introduction to the Theory of Statistics. 3rd ed. McGraw Hill, 564 pp.

  • Okkan, U., and U. Kirdemir, 2018: Investigation of the behavior of an agricultural-operated dam reservoir under RCP scenarios of AR5-IPCC. Water Resour. Manage., 32, 28472866, https://doi.org/10.1007/s11269-018-1962-0.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • R Core Team, 2017: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, https://www.R-project.org/.

  • Resnick, S. I., 2007: Heavy-Tail Phenomena: Probabilistic and Statistical Modeling. Springer Series in Operations Research and Financial Engineering, Springer, 404 pp.

  • Ripley, B. D., 2002: Time series in R 1.5.0. R News, No. 2(2), R Foundation, Vienna, Austria, 2–7, https://www.r-project.org/doc/Rnews/Rnews_2002-2.pdf.

  • Robbins, H., and S. Monro, 1951: A stochastic approximation method. Ann. Math. Stat., 22, 400407, https://doi.org/10.1214/aoms/1177729586.

  • Rubin, D. B., 1981: The Bayesian bootstrap. Ann. Stat., 9, 130134, https://doi.org/10.1214/aos/1176345338.

  • Schendel, T., and R. Thongwichian, 2015: Flood frequency analysis: Confidence interval estimation by test inversion bootstrapping. Adv. Water Resour., 83, 19, https://doi.org/10.1016/j.advwatres.2015.05.004.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Schendel, T., and R. Thongwichian, 2017a: Confidence intervals for return levels for the peaks-over-threshold approach. Adv. Water Resour., 99, 5359, https://doi.org/10.1016/j.advwatres.2016.11.011.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Schendel, T., and R. Thongwichian, 2017b: Considering historical flood events in flood frequency analysis: Is it worth the effort? Adv. Water Resour., 105, 144153, https://doi.org/10.1016/j.advwatres.2017.05.002.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Shao, J., and T. Dongsheng, 1995: The Jackknife and the Bootstrap. Springer-Verlag, 123 pp.

    • Crossref
    • Export Citation
  • von Storch, H., and F. W. Zwiers, 2001: Statistical Analysis in Climate Research. 1st ed. Cambridge University Press, 484 pp.

  • Wandler, D., and J. Hannig, 2012: Generalized fiducial confidence intervals for extremes. Extremes, 15, 6787, https://doi.org/10.1007/s10687-011-0127-9.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wasserman, L., 2005: All of Statistics: A Concise Course in Statistical Inference. Springer, 442 pp.

    • Crossref
    • Export Citation
  • Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields. J. Climate, 10, 6582, https://doi.org/10.1175/1520-0442(1997)010<0065:RHTFAF>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. An Introduction. 2nd ed. Academic Press, 627 pp.

  • View in gallery

    (left) (top) Observed temperature series from a station in Utah (Liu et al. 2009) and (middle),(bottom) two competing forecasts with 48-h lead times beginning at 0000 UTC. (center) Scatterplot of (top) model 1 against model 2 temperature series, (middle) absolute-error loss |Model 1 − Observed| and (bottom) |Model 2 − Observed|. (right column) Loss differential series [(top) i.e., |Model 1 − Observed| − |Model 2 − Observed|], and (middle) the loss differential’s autocorrelation function and (bottom) partial autocorrelation function graphs.

  • View in gallery

    As in Fig. 1, but for one realization of an iid resample with replacement. Interest, here, is in (top right) the mean of the loss differential series, so in practice, only this series need be resampled, but the corresponding resampled observation and models 1 and 2 series are shown for illustration.

  • View in gallery

    As in Fig. 2, but for one realization of a circular block (CB) resample with replacement, using block lengths of 50. For a single series of data x1, x2, …, xn, a CB resample with blocks of length would be a sample with replacement from y1={x1,,x}, y2={x2,,x+1},,yn+1={xn+1,,xn},,yn={xn,x1,x1}.

  • View in gallery

    Simulated set of two contemporaneously correlated MA(1) series from code example in section 4. Contemporaneous correlation and temporal dependence parameter are both set to 0.5.

  • View in gallery

    Resulting p values plotted against bootstrap estimates for MAL from using the tibber function to find 95% CI’s for MAL applied to contemporaneously correlated MA(1) simulations. Horizontal lines go through the bootstrap estimated p values associated with the 95% TIB CI’s, which are shown via the leftmost and rightmost vertical blue lines. The center vertical line shows the estimated MAL value.

  • View in gallery

    As in Fig. 5, but using the tibberRM function. Here, light blue symbols are MAL estimates from trying to identify the upper limit, and dark blue for the lower limit.

All Time Past Year Past 30 Days
Abstract Views 337 11 0
Full Text Views 540 464 22
PDF Downloads 509 416 19

Bootstrap Methods for Statistical Inference. Part I: Comparative Forecast Verification for Continuous Variables

View More View Less
  • 1 National Center for Atmospheric Research, Boulder, Colorado
Free access

Abstract

When making statistical inferences, bootstrap resampling methods are often appealing because of less stringent assumptions about the distribution of the statistic(s) of interest. However, the procedures are not free of assumptions. This paper addresses a specific situation that occurs frequently in atmospheric sciences where the standard bootstrap is not appropriate: comparative forecast verification of continuous variables. In this setting, the question to be answered concerns which of two weather or climate models is better in the sense of some type of average deviation from observations. The series to be compared are generally strongly dependent, which invalidates the most basic bootstrap technique. This paper also introduces new bootstrap code from the R package “distillery” that facilitates easy implementation of appropriate methods for paired-difference-of-means bootstrap procedures for dependent data.

Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JTECH-D-20-0069.s1.

© 2020 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Eric Gilleland, ericg@ucar.edu

This article has a companion article which can be found at http://journals.ametsoc.org/doi/abs/10.1175/JTECH-D-20-0070.1.

Abstract

When making statistical inferences, bootstrap resampling methods are often appealing because of less stringent assumptions about the distribution of the statistic(s) of interest. However, the procedures are not free of assumptions. This paper addresses a specific situation that occurs frequently in atmospheric sciences where the standard bootstrap is not appropriate: comparative forecast verification of continuous variables. In this setting, the question to be answered concerns which of two weather or climate models is better in the sense of some type of average deviation from observations. The series to be compared are generally strongly dependent, which invalidates the most basic bootstrap technique. This paper also introduces new bootstrap code from the R package “distillery” that facilitates easy implementation of appropriate methods for paired-difference-of-means bootstrap procedures for dependent data.

Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JTECH-D-20-0069.s1.

© 2020 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Eric Gilleland, ericg@ucar.edu

This article has a companion article which can be found at http://journals.ametsoc.org/doi/abs/10.1175/JTECH-D-20-0070.1.

1. Introduction

A common situation in atmospheric science is the need to test a new forecast, or modification of an existing one, against the currently used, or other baseline, model. Both of these questions present hidden challenges that are often overlooked. Even when it is recognized that standard parametric-based statistical tests might not be appropriate, bootstrap methods are often seen as a fix for any situation. However, bootstrap methods still require assumptions. The most commonly used bootstrap procedure, known as the independent and identically distributed (iid) bootstrap, fails to produce accurate results, for example, if dependence is present; the false rejection rate is too high when the data are serially dependent.

This paper has two primary objectives. The first is to review statistical inference with a focus on bootstrap methodology. In this part, comparative forecast verification of continuous variables is the focus, which typically involves highly dependent data, as well as contemporaneously dependent series to be compared. Bootstrap methods that will produce accurate results are described. The second aim is to describe new R software (R Core Team 2017), available in the “distillery” (Gilleland 2017) package for performing bootstrap inference in this setting, and explain how to use the code. A second companion paper describes another common situation for atmospheric science hypotheses that presents challenges of its own for bootstrap methodology–extreme-value analysis.

2. Background and notation

The bootstrap, introduced by Efron (1979), is a relatively simple procedure, but its explanation can become complicated because of the plug-in principle that effectively changes the roles of sample and population. In this section, a brief introduction to statistical inference is given, along with notation, in an attempt to alleviate any difficulty for the reader. While some of what follows in section 2a might seem trivial, there are important distinctions to be made between random variables, estimators, and estimates that will become important when the bootstrap paradigm is introduced in section 2b, which could otherwise be confusing if the tedium of notation is not fully understood.

a. Statistical language

In science, progress usually comes through experimentation where conclusions from a given experiment are necessarily generalized to a broader scope of similar experiments; a process known as inductive inference. Inductive inference is well known to be fraught with hazard (cf. Mood et al. 1963, p. 220). Nevertheless, inductive inference attempts to answer the more interesting and important questions, so it is important to understand that there is inherently uncertainty present. Statistical analysis aims to quantify aleatory, or sampling, uncertainty by modeling phenomena through probability distributions. Often these distributions involve one or more parameters that must be estimated using observed data, and it is these parameters that generally provide explanations about the question to be answered through the experiment. For example, a normal distribution has two parameters: a mean and a variance. Interest might be in whether the difference between a forecast and an observed series is zero, on average, or not. If the difference between the two series follows a normal distribution, then it is the mean that is of interest for answering the question.

Importantly, in statistical language, a parameter is a quantity that indexes a family of distribution functions; that is, a group of probability distributions that have the same functional form but whose properties vary according to the values of these parameters. A random variable is a set function whose domain is the set of all possible events and whose range is the real line. That is, in statistical language, a random variable is what atmospheric modelers refer to as a parameter. For example, a random variable may represent any phenomena of interest, such as 24-h accumulated precipitation amount, rain rate, surface air temperature, and streamflow. A statistic is a function of observable random variables, which strictly speaking is governed by a probability distribution that is free of any unknown parameters (cf. Mood et al. 1963, p. 226), although modern statistical inference methods, such as bootstrapping, allow for so-called nuisance parameters in the statistic’s distribution.

A random variable is generally described using capital letters, X, for example, where X is a value that is not yet realized, so that it represents potential values. The probability of a specific value’s being realized is governed by its probability distribution. When a value is finally observed, a lowercase letter, say x, is used to distinguish it from the unrealized random variable X. The observed value is not random, but is a realization of a random variable. Generally, interest is in a sequence of random variables, X1, X2, …, with an unknown joint distribution function that may depend on parameter(s) θ,
F(x1,x2,;θ)=Pr{X1x1,X2x2,}.

Because observed data are finite in nature, it is generally supposed that they can be modeled as a realization of the first n random variables Xn = {X1, …, Xn}.

Typically, it is not a specific realization of Xn that is of interest, but rather of one or more parameters θ of the joint distribution F. For example, Fig. 1 shows an observed temperature series from Utah along with two competing forecast models (left column; Liu et al. 2009). A typical question concerns which of the two models is better, in some sense, on average. In this example, absolute-error loss (bottom two panels of the middle column) is used as a loss metric where lower values are better. One way to assess which is better on average is to calculate the mean of the observed loss differential series, dt = (|Model 1 − Observed| − |Model 2 − Observed|)t, using the t subscript to emphasize that dt is a time series. While it is useful to visually inspect the loss differential series, it is not always possible in practice. Therefore, it is not the particular instance of the entire series of values that is of interest, but rather the population average μD.1 That is, letting Xn represent the values of the loss differential series, then Xn ~ F[x1, x2 …; θ = (μd, ψ)], for some (typically unknown) distribution F with a mean parameter μd and possibly other parameters ψ. It is the mean parameter that is of interest, but the other possible parameters may be important once inferences about μD are made.

Fig. 1.
Fig. 1.

(left) (top) Observed temperature series from a station in Utah (Liu et al. 2009) and (middle),(bottom) two competing forecasts with 48-h lead times beginning at 0000 UTC. (center) Scatterplot of (top) model 1 against model 2 temperature series, (middle) absolute-error loss |Model 1 − Observed| and (bottom) |Model 2 − Observed|. (right column) Loss differential series [(top) i.e., |Model 1 − Observed| − |Model 2 − Observed|], and (middle) the loss differential’s autocorrelation function and (bottom) partial autocorrelation function graphs.

Citation: Journal of Atmospheric and Oceanic Technology 37, 11; 10.1175/JTECH-D-20-0069.1

An estimator t(Xn) is a function of the random variables, and thus is itself a random variable, that provides a way to estimate μ. The estimate of t(Xn), denoted t(xn), is the realized value of the random estimator. Note that distributional parameters are typically denoted by Greek letters, e.g., θ, and an associated estimator is often written with a hat, e.g., θ^, so that an estimator of μ would be written t(Xn)=μ^. Another alternative that is sometimes used for the estimator and estimate is simply T = t(Xn) and t = t(xn).

Perhaps the most common estimator for the mean is μ^=X¯=i=1nXi/n, and the estimate is x¯=i=1nxi/n. To complete the example, X¯ is not the only possible estimator for μ. For example, if the distribution F is symmetric about μ, such as in the case when F is the normal distribution, then the median might be used as an estimator. Choice of an estimator is usually made based on optimality and robustness considerations, which are not discussed here [for more details, see, e.g., Mood et al. (1963) and Wasserman (2005)].

The example of the mean is an example of a level-1 parameter. A parameter that pertains to the uncertainty of an estimator for a level-1 parameter is a level-2 parameter. Bootstrap resampling procedures can be considered to be general methods for finding estimators of level-2 parameters, and functionals associated with the sampling distribution of an estimator for a level-2 parameter are level-3 parameters. Examples of level-2 parameters are the bias and mean-square error (MSE) of θ^, defined to be the difference between the first and second (centered) moments of the unknown distribution, F, and their estimators.

Note that in this context, bias and MSE are not the usual bias and MSE from weather forecast verification (e.g., Okkan and Kirdemir 2018) where it is the difference (or ratio) and mean-square difference between an observation and a physically based forecast model. In this context, the bias and MSE pertain to the expected estimation error between θ^ and the true unknown (population) parameter θ. To avoid confusion in this paper, the term loss, instead of error, is used when referring to the difference of the forecast model output minus the observed value.

b. The bootstrap paradigm

In the classical statistical inference setting, a single example is observed so that only one test statistic is observed, and an assumption about the distribution from where that test statistic was realized must be made. Bootstrap methods involve obtaining a sample of the test statistic, typically by resampling from the original sample, so that an assumption about the specific distribution for the test statistic can be avoided. It is helpful to begin the discussion with a concrete example.

Suppose the verification set from Fig. 1 were resampled with replacement where the same time points from the observed and two model series are sampled the same (paired test). For example, although there are 297 time points, suppose there were only 5. One possible resample, of length 5, might be (x4,x^14,x^24), (x3,x^13,x^23), (x4,x^14,x^24), (x2,x^12,x^22), (x2,x^12,x^22), where xi is the observed value at a time point, x^1i (x^2i) representing the ith time point for model 1 (model 2). Figure 2 shows one such resample of length 297. Comparing the figure with that of Fig. 1, clearly the resampled figure has very different temporal properties than the original series.

Fig. 2.
Fig. 2.

As in Fig. 1, but for one realization of an iid resample with replacement. Interest, here, is in (top right) the mean of the loss differential series, so in practice, only this series need be resampled, but the corresponding resampled observation and models 1 and 2 series are shown for illustration.

Citation: Journal of Atmospheric and Oceanic Technology 37, 11; 10.1175/JTECH-D-20-0069.1

Resampling from blocks of data with replacement maintains the temporal dependence better. Figure 3 shows a circular block (CB) resample with replacement from the series in Fig. 1 using blocks of length fifty. Clearly, this resample looks more like the original in terms of the temporal structure, although the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots of the loss differential suggest a shorter range of dependence, which is not surprising because the range of dependence of the original series is fairly long. In general, block lengths should be chosen to be much longer than the length of dependence, but much shorter than the entire series. For the data in the example, the series is too short to accomplish this goal.

Fig. 3.
Fig. 3.

As in Fig. 2, but for one realization of a circular block (CB) resample with replacement, using block lengths of 50. For a single series of data x1, x2, …, xn, a CB resample with blocks of length would be a sample with replacement from y1={x1,,x}, y2={x2,,x+1},,yn+1={xn+1,,xn},,yn={xn,x1,x1}.

Citation: Journal of Atmospheric and Oceanic Technology 37, 11; 10.1175/JTECH-D-20-0069.1

The principle behind the bootstrap paradigm is to try to recreate the relationship between the population and the sample by considering the observed sample, x1, …, xn, as a surrogate for the underlying population and, by resampling from this surrogate, to form the bootstrap sample that now plays the role that the observed sample ordinarily plays. Subsequently, the observed sample is often written with upper case letters, X1, …, Xn, to denote its changed role, and θ^ is the bootstrap estimate of θ. The notation Xn*=X1*,,Xn* describes the (potential) bootstrap resample and x*n=x*1,,x*n the realized bootstrap resample, and the bootstrap estimate of θ for the bth resample is denoted θ^n*(b), where the n subscript emphasizes the size of the resample. A bootstrap estimate for θ^ is given by θ^*=(1/B)b=1Bθ^n*(b). Based on this bootstrap principle, the problem of dealing directly with an unknown population is avoided. Instead, only the sample and resamples are used with the advantage that they are either known or from a known distribution. The fundamental assumption for any bootstrap resampling inference is that the relationship between θ^ and θ (i.e., between the population and the sample) is the same as that between θ^* and θ^ (i.e., the resamples and the sample; Lahiri 2003). As seen by the example in Figs. 13, how the resampling is performed can greatly affect the veracity of this assumption.

The general bootstrap procedure can then be described by the following steps. Suppose θ is a level-1 parameter of interest, and that it is a functional of the joint distribution F [i.e., θ = θ(F)].

  1. Decide on an estimator F˜n of F from the available observations xn.

  2. Generate B resamples x*n from the estimator F˜n, which is conditional on the original observations, xn.

  3. For each resample, estimate the bootstrap estimator θ^n* by replacing xn with x*n to obtain the sample of estimators θ^n*(1),,θ^n*(b).

The bootstrap version of the level-1 parameter, θ = θ(F), is θ^=θ(F˜n), meaning that the estimated parameter using the original sample is, in the bootstrap world, playing the role of the unknown parameter in the usual inference setting. The bootstrap estimator of the unknown sampling distribution, Gn, of the level-1 parameter is given by the conditional distribution, G^n, of its bootstrap version. The bootstrap estimator of the level-2 parameter, ηnη(Gn), is simply the plug-in estimator η^nη(G^n). For example, the bootstrap estimators of the bias and MSE for θ^n are
bias^(θ^n)=xG^n(dx)=E*(θ^n*)θ(F˜n),
MSE^(θ^n)=x2G^n(dx)=E*{[θ^n*θ(F˜n)]2},
where E* is the conditional expectation given Xn (Lahiri 2003). For example, suppose the parameter of interest is the mean-square loss (MSL) for a given verification set, and it is desired to estimate the bias for the MSL. This estimated bias can be found by taking the average over the B samples of MSL and subtracting the MSL estimated from the original sample xn from this result.

1) Sampling methods

Step 1 of the algorithm above is usually carried out by choosing F˜n to be the empirical distribution function, F^n()(1/n)i=1nI(Xi), where I(A) is the indicator function giving one if A is true and zero otherwise. That is, the resampling is carried out using simple random sampling with replacement from Xn. An assumption that X1, …, Xn are iid is implicit with this choice for F˜n. If this assumption is not met, then the estimated level-2 parameters will be low. To understand the reason for this phenomenon, consider taking a measurement of temperature once every second. In most cases, the value from one second to another will not change much, if at all, and while sixty measurements are recorded in one minute, effectively only one measurement has been recorded. Therefore, the variance of these measurements will surely be small, if not identically zero.

More precisely, suppose that Xn is a sequence of weakly dependent random variables whose mean and variance do not vary from one variable to the next, then
Var(X¯)=Var(X1)n+2ni=1n1(1i/n)Cov(X1,X1+i).

The variance depends on the bivariate distribution of X1 and Xi for all 1 ≤ in, but because of the weak dependence, the higher-order lag terms are negligible so that accurate approximations of the level-2 parameter can be obtained from the knowledge of a small subset of the lag covariance terms. These terms depend on the joint distributions of the shorter series X1, …, Xk with k < n. One way to choose F˜n to capitalize on this fact is to resample blocks of Xn as described in the caption of Fig. 3. In practice, it is best if the blocks overlap and that the end is tied to the beginning to avoid under sampling observations near the beginning or end of the series and over sampling those in the middle. Additionally, the lengths of the blocks should be considerably longer than the length of dependence, but also enough shorter than n so that the resamples are sufficiently rich in diversity. Using blocks of length =n (n the lowest integer greater than or equal to n) is usually a good rule of thumb, but in the example given in Fig. 1, the length of dependence appears to be larger than n=297=17. Figure 3 uses =50 and while the resample captures some of the dependence, it does not capture the full strength of the dependence.

An alternative to the above choices for F˜n is to employ a parametric model in step 1 of the procedure. Doing so requires more stringent assumptions about the joint distribution F; it requires assuming the specific distributional form. However, it allows for sampling values that might not have been observed, which can be very useful in some situations. For example, suppose it is desired to make inferences about the maximum value of a series. It will not be possible to sample a value higher than the observed maximum, so in this setting, a parametric resampling procedure is beneficial. It is also possible to directly model any dependence in the series within the parametric form. While an assumption about the distribution for Xn is required, no assumption about the specific distribution for the associated estimator(s) is required for the parametric bootstrap; estimates from the original sample are generally used for the parameters of the distribution from which the resamples are taken. In some literature, the parametric bootstrap refers to more than sampling from a parametric model. Rather, estimates of the parameters are taken to be the “true” parameter values of those distributions. In the “distillery” package, only the resampling part is parametric so that it is on the user to employ the true parameter values, if desired.

Another possibility for F˜n is to weight values more or less heavily in the resampling scheme; a method known as the Bayesian bootstrap (Rubin 1981). This procedure is not taken up here.

2) Number of resamples

Ideally, all possible resamples would be obtained, and accuracy claims about the bootstrap procedure are based on doing so. Generally, however, it is not possible to obtain all possible resamples. Therefore, only B such resamples are taken. Choice of B is a trade-off between accuracy and computational efficiency. A guideline for choosing B is to start with a modest value, say B = 100, and perform the procedure. Then, increase B, say B = 500, and check if the results differ considerably or only slightly (they will not be identical). Although it is advised to repeat this last step multiple times, generally only one suffices. If B = 100 is found to not be large enough, then repeat the procedure with an even higher value, say B = 1000 to determine whether or not B = 500 is sufficient. Note that most practitioners suggest that B be between 1000 and 2000 (Carpenter and Bithell 2000), but such a large value for B is not always necessary, and can greatly affect the computation time.

3) Resample sizes

In most cases, each resample from step 2 is of size n, but there are known situations where a different choice is desirable, primarily related to inferences for extremes where the random variables may follow a heavy-tail distribution; that is, the probability of observing increasingly high (or low) values decays polynomially. In this case, resamples of size m < n where m/n → 0 as m → ∞ and n → ∞; e.g., m=n (Bickel and Freedman 1981; Arcones and Giné 1989, 1991; Athreya 1987a,b; Giné and Zinn 1989; Knight 1989; Hall 1990; Kinateder 1992; LePage 1992; Deheuvels et al. 1993; Fukuchi 1994; Bickel et al. 1997; Feigin and Resnick 1997; Lee 1999; Shao and Dongsheng 1995; Resnick 2007).

Many good texts on bootstrapping are available for further reading, for example, in the iid setting, Efron and Tibshirani (1993) and Davison and Hinkley (1997) and in the dependence case, Lahiri (2003).

c. Statistical inference

Wilks (1997, 2006), Hamill (1999), Jolliffe and Stephenson (2003), Jolliffe (2007), von Storch and Zwiers (2001), and Gilleland (2010) all give good reviews of statistical inference in the realm of meteorology and climatology. Much of the discussion in section 2b concerns estimation of level-2 parameters, which is one essential ingredient for any statistical inference, and bootstrap resampling methods excel in this context. To set the stage, it is useful to begin by reviewing the usual hypothesis test procedure [in section 2c(1)] where it is desired to determine whether or not the true population parameter, θ, is equal to a certain value or values, say θ = θ0 (e.g., θ might be the rate of change of the global temperature anomaly and it could be of interest to test whether or not θ = θ0 = 0). A generally more useful method of inference utilizes confidence intervals, which will be discussed in section 2c(2). Because bootstrap estimation is a frequentist method, the focus of this section is on frequentist inference.

1) Hypothesis testing

Given a null and alternative hypothesis (H0 and Ha, respectively), a test procedure is specified by two components: a test statistic and a rejection region.2 The former is a function of the sample xn, upon which the decision to reject H0 or not is based. The latter is the set of all test statistic values for which H0 will be rejected. Naturally, in such a setting, two types of errors can be made: a type-I error occurs when H0 is true but the test rejects it, and a type II error fails to reject H0 when it is actually false.

A test statistic is a random variable that will involve an estimator, and it should have a distribution that is free of any unknown parameters. For example, theoretical justification exists to suggest that X¯n should follow a normal distribution with parameters μ and σ for large enough n. Therefore, the usual test statistic for the mean is given by
Z=X¯μσ/n.
Under the assumption that X¯n~N(μ,σ2), the statistic Z follows a standard normal distribution with mean zero and variance one; importantly, there are no unknown parameters in the distribution of Z. Note that Z itself involves unknown parameters (μ and σ), but its distribution [i.e., N(0, 1)] does not. In practice, of course, the level-2 parameter, σ, must be estimated. When replacing σ with its sample estimate, the normal distribution may no longer be an accurate probability model for this statistic. The sample standard deviation, Sn=1/n1i=1n(XiX¯n)2 can be used to estimate σ, leading to the test statistic
T=X¯μSn/n,
which follows a Student’s t distribution function with n − 1 degrees of freedom. The population parameter μ in the numerator is the one whose “true” value is to be tested, so it is replaced by μ = μ0 and is determined by the specific H0 of interest. In the case of testing if the mean loss differential is zero, μ0 = 0 and the statistic S reduces to nX¯n/σ.

In the classical statistical hypothesis test, it is important that the test statistic have a distribution function that is free of any unknown parameters. However, there are several methods for testing hypotheses that relax this requirement. When the test statistic’s distribution function has parameters, they are generally called nuisance parameters. In fact, some methods, such as the profile likelihood and test-inversion bootstrap utilize the nuisance parameters directly.

2) Confidence intervals

A (1 − α) × 100% confidence interval (CI) seeks to find the values θL and θU such that
Pr{θLSθU}=1α,
where S is the test statistic as in section 2c(1), and α is the size of the hypothesis test. When talking about CI’s, it is usual to talk about the (1 − α) × 100% confidence level, rather than the α × 100% size. However, when exploiting the duality of the confidence interval and the hypothesis test, it is permissible to discuss the results in terms of hypothesis testing.

A duality exists between hypothesis testing and confidence intervals, but a CI provides considerably more information than a hypothesis test. In particular, the interval gives a more intuitive sense of the amount of uncertainty, in practical terms, surrounding the estimated parameter. Moreover, a CI explicitly allows for hypothesis tests based on any value for the population parameter to be performed instead of just one at a time. That is, in the example for the rate of change in the temperature anomaly, one could test whether μ0 = 0, μ0 = 0.5, μ0 = 1, etc., using the same CI; whereas a new hypothesis test would need to be constructed for each choice of μ0. Most confidence intervals are two sided, and therefore correspond to two-sided hypothesis tests, but it is possible to construct a one-sided CI. Here, however, only two-sided CI’s are discussed.

In this frequentist setting, it is important to note that the interval obtained by Eq. (3) does not say anything about the probability of the true population parameter. Rather, it is a statement about the test statistic, S. The interpretation of the interval in terms of the true population parameter is relatively awkward. If the test were rerun under the same conditions 100 times, and the (1 − α) × 100% CI were to be constructed for each repetition, then the true parameter would be expected to fall inside (1 − α) × 100% of these intervals.3,4

A useful class of CI’s are the normal approximation CI’s. The underlying assumption is that the random sample is iid normally distributed, and the interval is given by
[θ^z1α/2se^(θ),θ^zα/2se^(θ)],
where zk is the k quantile of the standard normal distribution. Notice that the interval in Eq. (4) is symmetric because the standard normal distribution is symmetric about zero so that zα/2 = −z1−α/2. Other parametric CI’s will depend on the assumed distribution function for the test statistic, θ^, and may or may not be symmetric. The interval (4) may use the quantiles from the appropriate Student’s t distribution function instead, which provides a more accurate interval in the usual setting where the standard error term must be estimated from the sample.

3) Bootstrap confidence intervals

The result from step 3 in the bootstrap algorithm described in section 2b is a sample of test statistic(s), θ^*(1),,θ^*(B). This sample is considered to be a realization from the distribution of the test statistic(s), and CI’s must be estimated therefrom. Below are descriptions for the intervals available within the R (R Core Team 2017) package “distillery” (Gilleland 2017) that will be discussed here. Important considerations for these intervals include whether or not the intervals are range preserving, transformation respecting, as well as their accuracy. Other considerations are more pragmatic, such as the computational efficiency.

A CI is range preserving if the interval can only contain the true support of the parameter(s) for which the test statistic(s) is estimating. For example, the variance must be nonnegative, so any interval for the variance that is range preserving would not include negative numbers. A CI that is transformation respecting is one that will be a valid CI for a transformation by a monotone function. For example, suppose an interval is found for a parameter θ, but the desired interval is for g(θ), where g is a monotone function. If the CI is transformation respecting, and (θL, θU) is the CI for θ, then [g(θL), g(θU)] is a valid CI for g(θ).

Accuracy may vary according to the specific test statistic(s). The idea is that a (1 − α) × 100% CI should have probability of not covering the true parameter from above of α/2, and from below of α/2. A second-order accurate interval means that the error in these probabilities tends to zero at a rate that is inversely proportional to the sample size, while first-order accuracy means that the error tends to zero more slowly; at a rate inversely proportional to the square root of the sample size. Table 1 summarizes these properties for the CI methods described in this section.

Table 1.

Summary of the properties for the bootstrap CI methods.

Table 1.

(i) Basic bootstrap CI
The basic bootstrap CI, also called the non-Studentized pivotal method, are obtained by
(2θ^θ^[1α/2]*,2θ^θ^[α/2]*),
where θ^ is the estimated test statistic value(s) from the original sample, and θ^[k]* denotes the k quantile of the sample θ^1*,,θ^B* of bootstrap sampled test statistic(s).

The basic bootstrap CI from Eq. (5) is only first-order accurate, and it is neither range preserving nor transformation respecting. Nevertheless, the basic method is a very natural approach to construct a CI for θ because it seeks a function of the estimator and parameter whose distribution is known, and then constructs the CI’s using the quantiles of this known distribution to construct a CI for the parameter. In particular, consider the function θ^θ. The bootstrap principle implies that while it may not be possible to know the distribution F from Eq. (1), it is possible to learn about the relationship between the true parameter and its estimator by thinking of θ^ as the true parameter and looking at the relationship between it and its estimates θ^1*,,θ^B*. Moreover, a bias corrected θ^ is given by θ^+(θ^θ^*)=2θ^θ^*. Computationally, once the bootstrap sample is found, the basic CI is very efficient to construct.

(ii) Bootstrap t (Studentized) bootstrap CI’s
The bootstrap t CI is a modification of the basic interval (5) that attempts to be more accurate by accounting for differences between θ^θ and θ^*θ^ that, in practice, can be very different. The idea is to employ the test statistic, T*=(θ^*θ^)/se^(θ^*) instead of just θ^*θ^. The method requires estimating the standard error of θ^* at each iteration of the bootstrap algorithm in order to obtain se^(θ^*):
(θ^se^(θ^)t*1α/2,θ^se^(θ^)t*α/2),
where, again, θ^ and se^(θ^) are the estimates of θ and se(θ^) based on the original data, and t*k is the k quantile of T1*,,TB* (not the quantiles of the Student’s t distribution).

The bootstrap t interval is second-order accurate, but it is neither range preserving nor transformation respecting. If an estimator for the standard error is not available, it can be estimated through a double bootstrap, where the bth resample is resampled in order to estimate the sample standard error for resample b, and is performed for each b = 1, …, B. Double bootstrapping is not performed by the functions in “distillery,” as it ultimately leads to a computationally more expensive interval.

(iii) Percentile bootstrap CI
Perhaps the simplest CI to estimate is the percentile method CI, which is simply
(θ^[α/2]*,θ^[1α/2]*),
where θ^[k]* is as in Eq. (5); simply the α/2 and 1 − α/2 quantiles of the bootstrap sample θ^1*,,θ^B*.

The percentile method CI’s are first-order accurate, and both range preserving and transformation respecting. They are also very easy to explain and efficient to calculate. However, there are some additional assumptions about the distribution of the test statistic that may result in too-narrow intervals if those assumptions are not met. To understand these assumptions, it is helpful to understand the rationale for the percentile method CI, which may seem a bit strange at first, but again, recall that the fundamental assumption for the bootstrap approach is that the relationship between θ^* and θ^ mimics that of the relationship between θ^ and θ. Because the percentile CI does not involve θ^ in its calculation, its justification requires some finesse.

For a monotonically increasing function g(⋅), let ϕ = g(θ), ϕ^=g(θ^), and ϕ^*=g(θ^*), and suppose it is possible to find g(⋅) such that the following relation holds (Carpenter and Bithell 2000):
ϕ^*ϕ^~ϕ^ϕ~N(0,σ2).
Then the interval for θ is
(g1(ϕ^σz1α/2),g1(ϕ^σzα/2)).

Equation (8) implies that ϕ^σz1α/2=Fϕ^*1(α/2) and ϕ^σzα/2=Fϕ^*1(1α/2). Because g(⋅) is monotonically increasing, Fϕ^*1(k)=g(Fθ^*1(k)), and the interval from Eq. (9) becomes (Fθ^*1(α/2),Fθ^*1(1α/2)), which importantly does not involve g(⋅) so that it is not necessary to know the actual transformation that allows for the normal distribution assumption, only that such a transformation exists.

The accuracy of the percentile interval relies on the existence of g(⋅) such that Eq. (8) holds, and that the estimator is unbiased on the transformed scale so that the replacement of quantile estimates are correct. Additionally, it is possible that F˜ has a different shape from F, and while g(F˜) might be symmetric, it does not necessarily correct the shape difference. The next procedure attempts to correct for these potential assumption violations (Davison and Hinkley 1997; Carpenter and Bithell 2000).

(iv) BCa bootstrap CI
The bias-corrected and accelerated (BCa) method is an attempt to modify the percentile method so that it is more robust. It again simply takes quantiles from the sample θ^1*,,θ^B*, but modifies the specific quantiles taken in order to account for potential bias, asymmetry, skewness and shape changes in the distribution of ϕ^ϕ. Specifically, the interval is given by
(θ^[α1]*,θ^[α2]*),
where
α1=Φ(z^0+z^0+zα/21a^(z^0+zα/2))
and
α2=Φ(z^0+z^0+z1α/21a^(z^0+z1α/2)),

where Φ(⋅) is the standard normal distribution, zk is the k quantile of the standard normal distribution, and z^0 and a^ are estimated bias correction and acceleration terms, respectively. If a^=z^0=0, then the interval (7) is recovered; i.e., if ϕ^ϕ~N(0,σ2), then the BCa interval is the same as the percentile one.

The estimators used to calculate z^0 and a^ in the “distillery” package, and thus here, are those given in Efron and Tibshirani (1993) by Eqs. (14.14) and (14.15) therein; although the estimate used in “distillery” for the former also includes an adjustment for ties. Namely,
z^0=Φ1{b=1B[I(θ^b*<θ^)+I(θ^b*=θ^)/2]B},
where Φ−1(⋅) is the inverse of the standard normal distribution, and I(⋅) is the indicator function taking value one if the statement is true and zero otherwise, and
a^=i=1n(θ˜θ^i)36[i=1n(θ˜θ^i)2]3/2,
where θ^i is the estimated θ with the ith data point removed and θ˜=i=1nθ^i. The estimate for a^ might seem strange, but recall that it is an adjustment for skewness in the distribution of ϕ^ϕ, so it is just an estimate for this parameter.

The BCa interval (10) enjoys second-order accuracy, range preservation and respects transformations. On the other hand, the additional jackknife calculations in estimating a^ add to the computational complexity, so the method can be slow to compute. Further, for small α (typically α < 0.025), the coverage error increases and can even exceed that of the percentile method (Davison and Hinkley 1997; Carpenter and Bithell 2000). However, the issue is primarily with one-sided, rather than two-sided, CI’s and only occurs for rather extreme combinations of z0 and a (A. Davison 2020, personal communication; see also Davison and Hinkley 1997, their Table 5.8).

The estimates for z^0 and a^ used by “distillery” are for nonparametric bootstrap resampling. Equations specific to the statistic’s distribution function may exist for estimating the acceleration term, a^. When using parametric bootstrap sampling, where the parameter of interest is only one of several parameters, the distribution for ϕ^ϕ will now depend on the specific parametric model (Davison and Hinkley 1997, section 5.3). Currently, “distillery” does not use the modified equation, but it will still calculate the BCa with these equations even for parametric resampling because they are still valid in the case where no nuisance parameters exist.

It is possible to achieve a correction with the BCa interval that results in a value of α1 or α2 that is much closer to zero or one than α so that, even with interpolation, (B + 1) αi, with i = 1, 2 could be less than one or greater than B. It is then possible that the relevant quantile cannot be found. In such a case, the recommendation is to use the extreme value (max or min, resp.) of θ^* and quote the implied value of the quantile (Davison and Hinkley 1997, p. 205).

(v) Normal approximation bootstrap CI
The normal approximation bootstrap CI requires an assumption that the test statistic(s) be at least approximately normally distributed. The interval is constructed by
(2θ^θ^*+se^(θ^*)zα/2,2θ^θ^*se^(θ^*)zα/2),
where θ^*=b=1Bθ^b* and se^(θ^*)*=b=1B(θ^b*θ^*)/(n1). In other words, it is the usual normal approximation interval, but where it is bias corrected using the bootstrap estimate of the mean of the estimator(s), and the level-2 standard error parameter is estimated from the bootstrap sample θ^1*,,θ^B*. As such, it is generally more accurate than the usual normal approximation interval (4). However, it is only first-order accurate and it is neither range preserving nor transformation respecting.
(vi) Test inversion bootstrap CI

The test-inversion bootstrap (TIB) presents a very different procedure for finding CI’s from those discussed above. The idea is to take advantage of the duality between test inversion and CI’s (Garthwaite and Buckland 1992; Kabaila 1993; Carpenter 1999; Carpenter and Bithell 2000; Schendel and Thongwichian 2015, 2017a,b). That is, the correct endpoints, θU and θL for a (1 − α) × 100% CI (θL, θU) must satisfy Fθ^(θ^;θU,ψ)=1α/2 and Fθ^(θ^;θL,ψ)=α/2, where ψ represents one or more nuisance parameters. The method is parametric because it requires simulating data from a parametric distribution. It is similar in some ways to the profile-likelihood method for finding confidence intervals in that it involves varying a single parameter, in this case a nuisance parameter ψ (and optimizing over any additional nuisance parameters). In the profile-likelihood method, however, it is the parameter that is varied (i.e., ψ) whose CI’s are sought, whereas the TIB method seeks CI’s for θ.

To obtain a TIB CI, it is necessary to perform multiple bootstrap resampling steps.

  1. Select a value of the nuisance parameter, ψ = ψ0 (optimizing over any other nuisance parameters so that ψ0 represents a vector of nuisance parameters), and simulate data, Z, from the parametric model.

  2. Estimate θ^ from the sample, Z, and perform bootstrap resampling on Z in order to obtain a sample of θ^*(1),,θ^*(B).

  3. Estimate Prψ=ψ0{θ^*θ^|θ=θU} (and similarly for θL).

  4. Repeat the above steps as necessary.

The estimated probability in step 3 is computed in tibber and tibberRM from package “distillery” by
1Bb=1B[I(θ^*(b)<θ^)+I(θ^*(b)=θ^)/2],
with I(⋅) the indicator function as before. Ultimately, it is necessary to find θU and θL such that this estimated probability is as close to α/2 as possible.

The condition in the probability of step 3 that θ = θU (θ = θL) simply emphasizes that if θ^ is the limit sought, then the probability should be 1 − α/2 (or α/2). In practice, multiple repetitions are necessary, and some form of root-finding algorithm must be implemented. That is, an estimate for θU is sought so that the estimated probability in Eq. (12) is at least approximately 1 − α/2 (or α/2 for the lower bound). For a given sample from step 1, Eq. (12) is calculated and if it is close enough to 1 − α/2 (within a user-specified tolerance), then the estimate of θ^ from step 2 is used as the estimate for θU.

The tibber function uses a predefined sequence of nuisance parameters and repeats the process for each parameter choice, then an interpolation algorithm is applied to try to find the best choice. The tibberRM function, on the other hand, uses the Robbins–Monro algorithm (Robbins and Monro 1951) to try to find the solution as suggested by Garthwaite and Buckland (1992). In the former case, the two endpoints can usually be obtained from the same sample of bootstrap p values acquired in step 3 of the algorithm. For the latter approach, the procedures must generally be run separately for each end point. Regardless of the root-finding method employed, the TIB intervals are more computationally burdensome than the CI’s previously described. Furthermore, it can be difficult to obtain a solution making the intervals impractical for some problems. However, in certain situations, they have been found to be the most accurate, particularly for estimating parameters and return levels from extreme-value distributions for the stationary case (cf. Schendel and Thongwichian 2015, 2017a,b).

3. Comparative forecast verification

Each time a new forecast, or a modified version of an existing one, is proposed, interest lies in determining if the new forecast is better than the one currently in use. In this scenario, call the old forecast X^1=X^11,,X^1n and the new forecast X^2=X^21,,X^2n, and suppose the sense of better is in terms of an average loss function g(⋅) that is a function of the observations, X = {X1, …, Xn}. Note, again, the use of upper case letters for random variables and lower case for realized values. In this setting, the hat notation is being used for the forecasts because they are estimators of the observations, even though they are generally not strictly statistical estimators. For the discussion here, each forecast takes values in time, but more generally, it is possible to analyze model output in space or both time and space. The first letter of the subscripts identifies which forecast it is, and the second indexes which point in time. One useful summary is the expectation, E[D] = μD, of the loss differential series, D=g1g2{g(X^1i,Xi)g(X^2i,Xi)}i=1n, which can be estimated statistically by D¯. Examples of the loss function, g, include: simple loss, gj=gj(X^j,X)=X^jX, square-error loss, gj=gj(X^j,X)=(X^jX)2, and absolute-error loss, gj=gj(X^j,X)=|X^jX|, for j = 1, 2. Simple loss is less useful because it ultimately just compares the straight difference between the two forecasts as g1g2=X^1X(X^2X)=X^1X^2.

It is desired to test H0: μD = 0 against Ha: μD ≠ 0. If the two loss differential series are independent and identically distributed, then the test is relatively straightforward. However, it is common that D is not independent in time, which means that care must be taken in estimating the standard error of D¯. Moreover, it is likely that X^1 and X^2 will be correlated with each other because they are both predicting X; in this setting, such correlation is said to be contemporaneous. Even if X^1 were independent of X^2, because the loss series, g1 and g2, both involve X, these loss series will have at least some contemporaneous correlation depending on the loss function g. Contemporaneous correlation can have an impact on the accuracy of statistical hypothesis tests (cf. Hering and Genton 2011; Gilleland 2013; DelSole and Tippett 2014; Gilleland et al. 2018). Therefore, this important scenario represents a challenging example for bootstrap CI’s, which will be investigated herein.

Gilleland et al. (2018) compared empirical hypothesis tests of several testing procedures where pairs of loss vectors were simulated to have certain known dependencies in time and between each other (following Hering and Genton 2011). The testing procedures analyzed included the parametric Student’s t and normal z tests, the normal z test with a variance inflation factor applied, as well as all of the bootstrap procedures detailed in section 2c(3), here, except for the TIB interval. The bootstrap testing was performed with both iid bootstrap resampling [i.e., using F^n()=i=1nI(Xi)/n] and with CB bootstrap resampling. The iid bootstrap performed very well when no temporal dependence was simulated and poorly when temporal dependence was introduced, as expected. The CB bootstrap performed poorly when no temporal dependence was found and reasonably well otherwise; it was not affected by the contemporaneous correlation. Only the parametric testing procedure introduced in Hering and Genton (2011) outperformed the CB bootstrap in the face of temporal dependence and contemporaneous correlation.

In this paper, it will be shown how to duplicate the testing procedures utilized in Gilleland et al. (2018) using the bootstrap functions available in the R package “distillery.” The code to make the simulations, which were originally applied in Hering and Genton (2011), are also given. In addition to those methods, it is also shown, here, how to apply the TIB method to these simulations. Because of the additional computational complexity and difficulty to automate, however, it is not feasible to test the method as in Hering and Genton (2011) and Gilleland et al. (2018).

4. Simulation experiments of temporally dependent and contemporaneously correlated loss series

The following code can be used to simulate two loss series where each series has temporal dependence, and where the two are correlated contemporaneously. The simulations follow a first-order moving average model, denoted MA(1), with zero mean and unit variance (cf. Hamilton 1994; Brockwell and Davis 1996; Ripley 2002). They follow the simulation experiments utilized in Hering and Genton (2011) and Gilleland et al. (2018). All code below is written in R with actual R code written using this font. Here, rho is the contemporaneous correlation, tau is a value between zero and one where values closer to one have stronger temporal dependence, and n is the sample size. The code depends on the mnormt package (Azzalini and Genz 2016), which can be installed from the R prompt using install.packages. For checking that the simulations follow an MA(1) process, the function arima from the R package stats (R Core Team 2017) is used. Anything on a line after the “#” symbol is a comment.

# The set-up.

library(“mnormt”)

rho <- 0.5

tau <- 0.5

n <- 1000

f <- sqrt(1 + tau**2)

P <- matrix(c(1, 0, rho, sqrt(1 - rho**2)), byrow = TRUE, ncol = 2)

I2 <- diag(2)

# Simulate two random series that are correlated with each other,

# but independent in time.

vt <- matrix(0, ncol = 2, nrow = n)

for(i in 1:n) {

ut <- matrix(rmnorm(1, mean = rep(0, 2), varcov = I2), ncol = 2)

vt[ i, ] <- t(P %*% t(ut))

} # end of for 'i' loop.

# Introduce temporal dependence into both series, and

# ensure variance = 1 by dividing by f

# (Gilleland et al. 2018, section 3b).

et <- matrix(0, ncol = 2, nrow = n)

vt <- rbind(c(0, 0), vt)

for(i in 1:n) et[ i, ] <- (vt[ i + 1, ] + tau * vt[ i, ]) / f

# Check the dependencies.

# Contemporaneous correlation

cor(et[, 1 ], et[, 2 ])

# Temporal correlations.

cor(et[ 1:(n - 1), 1 ], et[ 2:n, 1 ])

cor(et[ 1:(n - 1), 2 ], et[ 2:n, 2 ])

# Fit an MA(1) model to the simulations.

mod1 <- arima(et[, 1], order = c(0, 0, 1))

mod1 # cf. Table 2.

mod2 <- arima(et[, 2], order = c(0, 0, 1))

mod2 # cf. Table 2.

# Check that the variances of each series are approximately 1.

var(et [, 1 ])

var(et[, 2 ])

# plots of the two simulated series as in Figure 4.

r <- range(c(et), finite = TRUE) + c(-1, 1)

par(mfcol = c(2, 3))

plot(et, main = “Scatter Plot”)

plot(1:n, et[, 1 ], type = “l”, main = “Time series”, ylim = r,

xlab = “Time”, ylab = “et”)

lines(1:n, et[, 2], lty = 2, col = “gray”)

acf(et[, 1 ])

pacf(et[,1])

acf(et[, 2 ])

pacf(et[, 2 ])

Figure 4 shows plots from one such simulation as above. In the case plotted in the figure, the actual estimated contemporaneous correlation is approximately 0.51, and while the lag-1 temporal correlations are about 0.39 and 0.42 for the first and second simulated series, respectively, the MA(1) parameters τ (cf. Table 2) are approximately 0.51 and 0.53, with innovation variance (displayed in the printout for mod1 and mod2 as sigma2) of about 0.76 and 0.82.

Fig. 4.
Fig. 4.

Simulated set of two contemporaneously correlated MA(1) series from code example in section 4. Contemporaneous correlation and temporal dependence parameter are both set to 0.5.

Citation: Journal of Atmospheric and Oceanic Technology 37, 11; 10.1175/JTECH-D-20-0069.1

Table 2.

Results from one instance of fitting an MA(1) model to the simulations of contemporaneously correlated competing forecast loss series. Estimated contemporaneous correlation between et[,1] and et[,2] is ρ^0.51.

Table 2.

5. Calculating CI’s with the R software package “distillery”

Now, the interest for this example is in testing whether or not the mean loss (ML), mean absolute loss (MAL), and MSL are statistically significantly different from zero or not. To run the test, a function that calculates these differences is needed. The function must minimally take the arguments data and ’...’, where the three dots inform R that any additional arguments can be passed to the function besides those explicitly stated in the argument list between the parentheses, which is useful if a function with additional arguments is called within this function. It should output a vector giving the three average values: ML, MAL and MSL, respectively. The simulated series, et, will ultimately be passed into the function via the data argument. These series are simulated loss, so to obtain absolute loss and square-error loss, simply take the absolute value or square, respectively:

lossdiff.fun <- function(data, ...) {

lossdiff <- data[, 1 ] - data[, 2 ]

ALlossdiff <- abs(data[, 1 ]) - abs(data[, 2 ])

SLlossdiff <- (data[, 1 ])^2 - (data[, 2 ])^2

out <- c(mean(lossdiff), mean(ALlossdiff),

mean(SLlossdiff))

names(out) <- c(“ML”, “MAL”, “MSL”)

return(out)

} # end of 'lossdiff.fun' function.

In the above lossdiff.fun function, the second column of data is subtracted from the first, and assigned to the object loss. The names of the returned vector component (using the names function) make resulting output printed to the screen more understandable.

Next, it is necessary to create bootstrap samples of ML, MAL and MSL, which is handled by the booter function from “distillery.” Because it is known that the data are dependent in time, which is also confirmed by the ACF and PACF plots in Fig. 4, the values found from having fit the MA(1) model to the data, and from the lag-1 correlations, it is important to apply the CB bootstrap approach. For this approach, it is necessary to choose a block length, which should be much larger than the range of dependence, but much shorter than the length of the series. Generally, the square root of the sample size is a good choice, which in this case is about 32. The default block length is 1, so if the argument is not specified, the iid bootstrap resampling procedure is carried out:

res <- booter(x = et, statistic = lossdiff.fun, B = 500,

block.length = 32)

summary(res)

The summary output shows the original call to the booter function along with the estimated mean loss differential values and their bootstrap estimated (statistical) bias. The ci method function from package “distillery” is called to obtain the bootstrap CI’s. Because estimates for the standard errors of the loss values are not returned by lossdiff.fun, the Studentized bootstrap t method cannot be used. The other methods, however, can be applied. The following code will take a few moments to run because the BCa method requires another round of (jackknife) resampling to calculate. For this call, the default confidence level of 95% (α = 0.05) is used, but can be changed by using the alpha argument in the call to ci. For this example, the two loss series are simulated to have an average of zero (i.e., neither model is better than the other on average), so that E[D¯]=μD=0. Thus, one would expect to find zero within the CI’s:

ci(res)

The printout for one implementation of the above code is shown in Table 3. In each case, the CB bootstrap correctly identifies that the ML is not significantly different from zero regardless of the CI method used because zero is included in each interval. However, MAL and MSL all show significant differences for each CI method because zero falls below the lower limit of each interval.

Table 3.

Sample printout from ci for one implementation of the booter function on MA(1) simulated data. Last two rows of table contains the results for ci from tibber and tibberRM. Results will vary.

Table 3.

To use the Studentized bootstrap t method, it is necessary to also return variance estimates of the estimated parameters. Therefore, the lossdiff.fun function must be modified, which is done below and called lossdiff.fun2:

lossdiff.fun2 <- function(data, ...) {

N <- dim(data)[ 1 ]

lossdiff <- data[, 1 ] - data[, 2 ]

AL <- abs(data[, 1 ]) - abs(data[, 2 ])

SL <- data[, 1 ]^2 - data[, 2 ]^2

out <- c(mean(lossdiff), var(lossdiff) / N, mean(AL),

var(AL) / N, mean(SL), var(SL) / N)

names(out) <- c(’’ML”, ’’var(ML)”,

’’MAL’’, ”var(MAL)”, ”MSL”, ”var(MSL)”)

return(out)

} # end of ’lossdiff.fun2’ function.

Because of the way subsequent functions work, the order of the variance terms is important because they must be in the same order as the level-1 parameters, and there must be a variance estimate for each level-1 parameter. It does not matter whether or not the variance terms fall immediately after the level-1 parameters, but they need to be in the same respective order from one another as the level-1 parameters are to one another. Resampling is carried out similarly as before, but the v.terms argument must be specified, which tells the function which values in the returned vector correspond to variance estimates for the level-1 parameters. The subsequent call to ci to calculate the various bootstrap CI methods can now be used to include the “stud” option:

res2 <- booter(x = et, statistic = lossdiff.fun2, B = 500,

block.length = 32, v.terms = c(2, 4, 6))

summary(res2)

ci(res2)

Now the result of the summary command gives the additional bootstrap variance estimate along with the bootstrap estimates of the level-1 parameter and its (statistical) bias, and the ci function returns the bootstrap t CI along with the others.

Parametric bootstrapping can also be performed with “distillery” by way of the pbooter function. This form of parametric bootstrap deviates from the more traditional method (cf. Davison and Hinkley 1997). It uses the parametric model in place of resampling from the data as is wont, but does not utilize distributional properties of the known distribution (i.e., the parametric model) to estimate level-1 and level-2 parameters, or CI’s. While more accurate estimates can be obtained in doing so, pbooter currently uses the traditional methods from nonparametric bootstrap sampling for calculating these values. For a given model, however, a user can easily obtain more accurate intervals by using these methods (see Davison and Hinkley 1997, p. 16).

First, a function to simulate from the parametric model is required. It must take arguments size, the desired sample size, and ’...’, which allows for any additional arguments to be passed into the function:

simmer <- function(size, ...) {

args <- list(...)

tau <- args$tau

f <- sqrt(1 + tau**2)

rho <- args$rho

P <- matrix(c(1, 0, rho, sqrt(1 - rho**2)), byrow = TRUE,

ncol = 2)

I2 <- diag(2)

vt <- matrix(0, ncol = 2, nrow = size)

for(i in 1:size) {

ut <- matrix(rmnorm(1, mean = c(args$m1, args$m2),

varcov = I2), ncol = 2)

vt[ i, ] <- t(P %*% t(ut))

} # end of for 'i' loop.

out <- matrix(0, ncol = 2, nrow = size)

vt <- rbind(c(0, 0), vt)

for(i in 1:size) out[ i, ] <- (vt[ i + 1, ] + tau * vt[ i, ]) / f

return(out)

} # end of 'simmer' function.

Next, use pbooter in a similar fashion as booter. The variables tau, rho, m1, and m2 are passed into the simmer function by way of the ’...’ argument:

parRes <- pbooter(x = et, statistic = lossdiff.fun, B = 500,

rmodel = simmer, tau = 0.5, rho = 0.5, m1 = mean(et[, 1]),

m2 = mean(et[, 2 ]), verbose = TRUE)

summary(parRes)

ci(parRes)

For this example, results are similar to those from other methods and are not shown here for brevity.

For the TIB method, a parametric model with nuisance parameter(s) is also required, but it takes a slightly different form than for pbooter. The reason for the differences stems from the manner in which the parameters of the model are allowed to vary in making the simulations under the TIB paradigm. Because the model for this example is known, a simulation function can be generated to produce it, but it will also be of interest to generate a well-fitted, if incorrect, model to check the sensitivity of model selection. The functions must minimally take the arguments: data, par, n and ’...’. The first argument passes in the dataset (et in this example), which might not be used by the simulation function. The second argument, par, is the value of the nuisance parameter(s), n is the sample size, and ’...’ are any additional arguments that might be needed. For this example, the model generates two contemporaneously correlated MA(1) series, where the parameters of interest are functionals of the underlying distribution. So, one possible choice of nuisance parameter to vary, which will be used here, is the innovation variance for the second series. Other nuisance parameters include: the innovation variance of the first series, contemporaneous correlation between the two series, moving average parameter (assuming the same one for each series), and the mean of the series. These parameters will be estimated from the original sample in order to obtain series that follow the original sample as closely as possible:

simmerTIB <- function(data, par, n, ...) {

tau <- arima(data[, 1 ], order = c(0, 0, 1))$coef[ 1 ]

f <- sqrt(1 + tau**2)

rho <- cor(data[, 1 ], data[, 2 ])

P <- matrix(c(1, 0, rho, sqrt(1 - rho**2)), byrow = TRUE,

ncol = 2)

S2 <- diag(c(var(data[, 1 ]), par))

vt <- matrix(0, ncol = 2, nrow = n)

for(i in 1:n) {

ut <- matrix(rmnorm(1, mean = c(mean(data[, 1 ]),

mean(data[, 2 ])), varcov = S2), ncol = 2)

vt[ i, ] <- t(P %*% t(ut))

} # end of for 'i' loop.

out <- matrix(0, ncol = 2, nrow = n)

vt <- rbind(c(0, 0), vt)

for(i in 1:n) out[ i, ] <- (vt[ i + 1, ] + tau * vt[ i, ]) / f

return(out)

} # end of 'simmerTIB' function.

Next, because of the need to perform a root finding algorithm with the TIB method, CI’s for only one statistic can be applied at a time. Therefore, it is necessary to modify lossdiff.fun slightly. In this case, the MAL will be returned:

tlossdiff.fun <- function(data, ...) {

lossdiff <- abs(data[, 1 ]) - abs(data[, 2 ])

return(mean(lossdiff))

} # end of 'tlossdiff.fun' function.

Now the tibber function can be called in order to produce 95% CI’s for the MAL. This function will make repeated calls to the booter function before finally identifying the CI’s by way of linear interpolation from having stepped through simulations using a series of values of the nuisance parameter. Choosing the nuisance parameter values to step through may take trial and error, but some idea of what the values should focus around is a good place to start. In this example, the innovation variances are close to unity, so a sequence of 100 values from 0.85 to 1.15 are chosen. Because the algorithm can take time to run, setting verbose to TRUE allows for checking the progress of the algorithm:

look <- tibber(x = et, statistic = tlossdiff.fun, B = 500,

rmodel = simmerTIB,

test.pars = seq(0.85, 1.15, length.out = 100),

verbose = TRUE)

look

The penultimate row of Table 3 displays the output from one run of this function. Actual values will vary because of the nature of simulating data. The estimated 95% CI’s are in line with those found by the other bootstrap methods, and suggest no statistically significant difference between the two series according to MAL.

The tibberRM function finds the TIB CI by way of the Robbins–Monro algorithm. Instead of passing in a sequence of values to try for the nuisance parameter, initial values for each limit (lower and upper) are used along with a step size. These values must be found by trial and error, but setting the tolerance a little higher can help the algorithm find CI’s that are reasonably close to the desired confidence level:

lookRM <- tibberRM(x = et, statistic = tlossdiff.fun, B = 500,

rmodel = simmerTIB, startval = c(1.5, 0.75),

step.size = 0.03, tol = 0.001,

verbose = TRUE)

lookRM

The last row of Table 3 displays the printed output for one implementation of tibberRM on these simulations. The achieved bootstrap-estimated p values are not where they need to be at 0.006 (lower) and 1.00 (upper), where for a 95% CI, they should be 0.05 and 0.975, respectively. It might be possible to improve on these estimates through trial and error by choosing different values for the startval, step.size and tol arguments.

Because, in general, it can be difficult to find adequate estimated values from the numerical root finding algorithms, both tibber and tibberRM have plot methods associated with them, which graphically displays the estimated p values against the estimates:

plot(look)

plot(lookRM)

Figures 5 and 6 display the results of the above plot commands. The plots take the backward S-shape, and should have the estimated value plotted near where the outer vertical lines cross the horizontal lines. In this case, the interpolation method seems to have found appropriate limits. Figure 6 confirms that the Robbins–Monro algorithm requires further trial and error, especially for the upper limit, to find an appropriate 95% CI.

Fig. 5.
Fig. 5.

Resulting p values plotted against bootstrap estimates for MAL from using the tibber function to find 95% CI’s for MAL applied to contemporaneously correlated MA(1) simulations. Horizontal lines go through the bootstrap estimated p values associated with the 95% TIB CI’s, which are shown via the leftmost and rightmost vertical blue lines. The center vertical line shows the estimated MAL value.

Citation: Journal of Atmospheric and Oceanic Technology 37, 11; 10.1175/JTECH-D-20-0069.1

Fig. 6.
Fig. 6.

As in Fig. 5, but using the tibberRM function. Here, light blue symbols are MAL estimates from trying to identify the upper limit, and dark blue for the lower limit.

Citation: Journal of Atmospheric and Oceanic Technology 37, 11; 10.1175/JTECH-D-20-0069.1

6. Discussion and conclusions

This paper demonstrates how to use the bootstrap functions available in the R package “distillery” with application to competing forecast verification of continuous variables. Bootstrap methods have been shown to be highly accurate in this area relative to other methods often used such as the normal z or Studentized t tests with or without variance inflation factor applied to account for dependence. However, the most commonly employed bootstrap procedure is not valid because of an assumption of independence in the data that is generally not met. Although not as accurate as the Hering–Genton test (Hering and Genton 2011), the bootstrap method is generally easier to implement.

While the “boot” package (Davison and Hinkley 1997; Canty and Ripley 2017) in R provides excellent utility for performing bootstrap resampling and estimating CI’s, the functions in “distillery” make certain operations easier; some of which are not possible with “boot.” For example, this manual demonstrates how to perform a test-inversion bootstrap (TIB) bootstrap, which do not appear to be possible in “boot” to the best of this author’s knowledge. The functions, here, are aimed at geophysical applications where tests for differences in mean, for example, generally have paired data.

Acknowledgments

NCAR is sponsored by NSF and managed by the University Corporation for Atmospheric Research.

Data availability statement

All data used in this paper are available from the distillery package in R.

REFERENCES

  • Arcones, M. A., and E. Giné, 1989: The bootstrap of the mean with arbitrary bootstrap sample. Ann. Inst. Henri Poincaré, 25, 457481.

    • Search Google Scholar
    • Export Citation
  • Arcones, M. A., and E. Giné, 1991: Additions and corrections to “The bootstrap of the mean with arbitrary bootstrap sample.” Ann. Inst. Henri Poincaré, 27, 583595.

    • Search Google Scholar
    • Export Citation
  • Athreya, K. B., 1987a: Bootstrap of the mean in the infinite variance case. Proc. First World Congress of the Bernoulli Society, Utrecht, Netherlands, Bernoulli Society, 95–98.

  • Athreya, K. B., 1987b: Bootstrap of the mean in the infinite variance case. Ann. Stat., 15, 724731, https://doi.org/10.1214/aos/1176350371.

  • Azzalini, A., and A. Genz, 2016: mnormt: The Multivariate Normal and t Distributions, version 1.5-5. R package, https://azzalini.stat.unipd.it/SW/Pkg-mnormt.

    • Crossref
    • Export Citation
  • Bickel, P. J., and D. A. Freedman, 1981: Some asymptotic theory for the bootstrap. Ann. Stat., 9, 11961217, https://doi.org/10.1214/aos/1176345637.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Bickel, P. J., F. Götze, and W. R. van Zwet, 1997: Resampling fewer than n observations: Gains, losses, and remedies for losses. Stat. Sin., 7, 131.

    • Search Google Scholar
    • Export Citation
  • Brockwell, P. J., and R. A. Davis, 1996: Introduction to Time Series and Forecasting. Springer, 420 pp.

    • Crossref
    • Export Citation
  • Canty, A., and B. Ripley, 2017: boot: Bootstrap R (S-Plus) Functions, version 1.3-20. R package, http://statwww.epfl.ch/davison/BMA/.

  • Carpenter, J., 1999: Test inversion bootstrap confidence intervals. J. Roy. Stat. Soc., 61B, 159172, https://doi.org/10.1111/1467-9868.00169.

  • Carpenter, J., and J. Bithell, 2000: Bootstrap confidence intervals: When, which, what? A practical guide for medical statisticians. Stat. Med., 19, 11411164, https://doi.org/10.1002/(SICI)1097-0258(20000515)19:9<1141::AID-SIM479>3.0.CO;2-F.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Davison, A., and D. Hinkley, 1997: Bootstrap Methods and Their Application. Cambridge University Press, 582 pp.

    • Crossref
    • Export Citation
  • Deheuvels, P., D. M. Mason, and G. R. Shorack, 1993: Some results on the influence of extremes on the bootstrap. Ann. Inst. Henri Poincaré Probab. Stat., 29, 83103.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill. Mon. Wea. Rev., 142, 46584678, https://doi.org/10.1175/MWR-D-14-00045.1.

  • Efron, B., 1979: Bootstrap methods: Another look at the jackknife. Ann. Stat., 7, 126, https://doi.org/10.1214/aos/1176344552.

  • Efron, B., and R. Tibshirani, 1993: An Introduction to the Bootstrap. Vol. 57. Chapman and Hall, 436 pp.

    • Crossref
    • Export Citation
  • Feigin, P., and S. I. Resnick, 1997: Linear programming estimators and bootstrapping for heavy-tailed phenomena. Adv. Appl. Probab., 29, 759805, https://doi.org/10.2307/1428085.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Fukuchi, J.-I., 1994: Bootstrapping extremes of random variables. Ph.D. thesis, Dept. of Statistics, Iowa State University, 96 pp., https://doi.org/10.31274/rtd-180813-10322.

    • Crossref
    • Export Citation
  • Garthwaite, P. H., and S. T. Buckland, 1992: Generating Monte Carlo confidence intervals by the Robbins-Monro process. Appl. Stat., 41, 159171, https://doi.org/10.2307/2347625.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gilleland, E., 2010: Confidence intervals for forecast verification. NCAR Tech. Rep. NCAR/TN-479+STR, 71 pp.

  • Gilleland, E., 2013: Testing competing precipitation forecasts accurately and efficiently: The spatial prediction comparison test. Mon. Wea. Rev., 141, 340355, https://doi.org/10.1175/MWR-D-12-00155.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Gilleland, E., 2017: distillery: Method Functions for Confidence Intervals and to Distill Information from an Object, version 1.0-4. R package, https://www.ral.ucar.edu/staff/ericg.

  • Gilleland, E., A. S. Hering, T. L. Fowler, and B. G. Brown, 2018: Testing the tests: What are the impacts of incorrect assumptions when applying confidence intervals or hypothesis tests to compare competing forecasts? Mon. Wea. Rev., 146, 16851703, https://doi.org/10.1175/MWR-D-17-0295.1.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Giné, E., and J. Zinn, 1989: Necessary conditions for the bootstrap of the mean. Ann. Stat., 17, 684691, https://doi.org/10.1214/aos/1176347134.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hall, P., 1990: Asymptotic properties of the bootstrap for heavy-tailed distributions. Ann. Probab., 18, 13421360, https://doi.org/10.1214/aop/1176990748.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hamilton, J. D., 1994: Time Series Analysis. Princeton University Press, 799 pp.

    • Crossref
    • Export Citation
  • Hannig, J., 2009: On generalized fiducial inference. Stat. Sin., 19, 491544.

  • Hannig, J., H. K. Iyer, and P. Patterson, 2006: Fiducial generalized confidence intervals. J. Amer. Stat. Assoc., 101, 254269, https://doi.org/10.1198/016214505000000736.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hannig, J., H. K. Iyer, and C. M. Wang, 2007: Fiducial approach to uncertainty assessment accounting for error due to instrument resolution. Meteorologia, 44, 476483, https://doi.org/10.1088/0026-1394/44/6/006.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hering, A. S., and M. G. Genton, 2011: Comparing spatial predictions. Technometrics, 53, 414425, https://doi.org/10.1198/TECH.2011.10136.

  • Jolliffe, I. T., 2007: Uncertainty and inference for verification measures. Wea. Forecasting, 22, 637650, https://doi.org/10.1175/WAF989.1.

  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley, 254 pp.

  • Kabaila, P., 1993: Some properties of profile bootstrap confidence intervals. Aust. J. Stat., 35, 205214, https://doi.org/10.1111/j.1467-842X.1993.tb01326.x.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kinateder, J. G., 1992: An invariance principle applicable to the bootstrap. Exploring the Limits of Bootstrap, R. Lepage and L. Billard, Eds., Wiley Series in Probability and Mathematical Statistics, Wiley, 157–181.

  • Knight, K., 1989: On the bootstrap of the sample mean in the infinite variance case. Ann. Stat., 17, 11681175, https://doi.org/10.1214/aos/1176347262.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lahiri, S. N., 2003: Resampling Methods for Dependent Data. Springer, 374 pp.

    • Crossref
    • Export Citation
  • Lee, S., 1999: On a class of m out of n bootstrap confidence intervals. J. Roy. Stat. Soc., 61B, 901911, https://doi.org/10.1111/1467-9868.00209.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • LePage, R., 1992: Bootstrapping signs. Exploring the Limits of Bootstrap, R. Lepage and L. Billard, Eds., Wiley Series in Probability and Mathematical Statistics, Wiley, 215–224.

  • Lidong, E., J., Hannig, and H. Iyer, 2008: Fiducial intervals for variance components in an unbalanced two-component normal mixed linear model. J. Amer. Stat. Assoc., 103, 854865, https://doi.org/10.1198/016214508000000229.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Liu, Y., and Coauthors, 2009: An operational mesoscale ensemble data assimilation and prediction system: E-RTFDDA—System design and verification. 19th Numerical Weather Prediction Conf./23rd Weather Forecasting Conf., Omaha, NE, Amer. Meteor. Soc., 16A.4, https://ams.confex.com/ams/23WAF19NWP/techprogram/paper_154271.htm.

  • Mood, A. M., F. A. Graybill, and D. C. Boes, 1963: Introduction to the Theory of Statistics. 3rd ed. McGraw Hill, 564 pp.

  • Okkan, U., and U. Kirdemir, 2018: Investigation of the behavior of an agricultural-operated dam reservoir under RCP scenarios of AR5-IPCC. Water Resour. Manage., 32, 28472866, https://doi.org/10.1007/s11269-018-1962-0.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • R Core Team, 2017: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, https://www.R-project.org/.

  • Resnick, S. I., 2007: Heavy-Tail Phenomena: Probabilistic and Statistical Modeling. Springer Series in Operations Research and Financial Engineering, Springer, 404 pp.

  • Ripley, B. D., 2002: Time series in R 1.5.0. R News, No. 2(2), R Foundation, Vienna, Austria, 2–7, https://www.r-project.org/doc/Rnews/Rnews_2002-2.pdf.

  • Robbins, H., and S. Monro, 1951: A stochastic approximation method. Ann. Math. Stat., 22, 400407, https://doi.org/10.1214/aoms/1177729586.

  • Rubin, D. B., 1981: The Bayesian bootstrap. Ann. Stat., 9, 130134, https://doi.org/10.1214/aos/1176345338.

  • Schendel, T., and R. Thongwichian, 2015: Flood frequency analysis: Confidence interval estimation by test inversion bootstrapping. Adv. Water Resour., 83, 19, https://doi.org/10.1016/j.advwatres.2015.05.004.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Schendel, T., and R. Thongwichian, 2017a: Confidence intervals for return levels for the peaks-over-threshold approach. Adv. Water Resour., 99, 5359, https://doi.org/10.1016/j.advwatres.2016.11.011.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Schendel, T., and R. Thongwichian, 2017b: Considering historical flood events in flood frequency analysis: Is it worth the effort? Adv. Water Resour., 105, 144153, https://doi.org/10.1016/j.advwatres.2017.05.002.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Shao, J., and T. Dongsheng, 1995: The Jackknife and the Bootstrap. Springer-Verlag, 123 pp.

    • Crossref
    • Export Citation
  • von Storch, H., and F. W. Zwiers, 2001: Statistical Analysis in Climate Research. 1st ed. Cambridge University Press, 484 pp.

  • Wandler, D., and J. Hannig, 2012: Generalized fiducial confidence intervals for extremes. Extremes, 15, 6787, https://doi.org/10.1007/s10687-011-0127-9.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wasserman, L., 2005: All of Statistics: A Concise Course in Statistical Inference. Springer, 442 pp.

    • Crossref
    • Export Citation
  • Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelated fields. J. Climate, 10, 6582, https://doi.org/10.1175/1520-0442(1997)010<0065:RHTFAF>2.0.CO;2.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. An Introduction. 2nd ed. Academic Press, 627 pp.

1

The population mean μD is the average obtained if the two models could be run to infinity (or at least as long as Earth exists and has weather to model). Because it is generally not possible to wait through to infinity to calculate this mean, a finite, representative sample of model runs is used in order to infer its value.

2

A test statistic is a type of estimator. In classical hypothesis testing, e.g., as opposed to bootstrap resampling methods, it was more important to make this distinction. In this paper, the term estimator will generally be used except in some cases to emphasize the role of the estimator in hypothesis testing.

3

In the Bayesian estimation framework, it is possible to construct an interval with a more appealing interpretation, namely, about the probability that the true parameter falls inside the interval, because the population parameter is considered a random variable under this paradigm, where in the frequentist setting it is a fixed but unknown quantity. Bayesian estimation is not considered here.

4

It is possible to obtain intervals with a more satisfying interpretation under the frequentist context. In this paradigm, a pivotal quantity, Qθ = q(X1, …, Xn; θ), is required. A pivotal quantity is a test statistic that is a function of the random sample and possibly unknown parameters where its distribution function is free of any unknown parameters. θ is taken to be a random variable whose distribution function is not known. Instead, based on the sample statistic’s distribution function, it is possible to estimate Pr{q1Qθq2} = 1 − α. In this expression, the unknown parameter θ can be isolated because of the nature of the pivotal quantity. Thus, the probability still holds but now is valid for that of the unknown parameter, θ. This fiducial approach is also not discussed further here (for more about fiducial testing, see Hannig 2009; Lidong et al. 2008; Hannig et al. 2006, 2007; Wandler and Hannig 2012).

Supplementary Materials

Save