## 1. Introduction

The estimation of robust parameters that describe change from time series of data and their statistical significance is one important aspect of modern climate research. Sophisticated and appropriate statistical methods are needed for this task. Data for climate studies is collected within large international projects, designed for time spans of several decades—for example, the Global Climate Observing System (GCOS). Such data includes both in situ measurements, such as radiosondes, and remote sensing measurements from spaceborne instruments. This effort has been generating long-term time series of diverse quantities (Schulz et al. 2009). The most famous climate time series ever recorded is probably the CO_{2} time series at the Mauna Loa Observatory at Hawaii, initiated by the late C. D. Keeling (Keeling et al. 2001). Long-term time series provide extremely useful information about the change of the respective quantities in form of trends. The separation of the daily, weekly, monthly, seasonal, yearly, decadal, and multidecadal changes, which are often called trends, is challenging, and the significance or knowledge of these trends depends on the length of the data, the instrumental noise, the natural variability and noise, and also on autocorrelations in the noise. Additionally any changes in instruments, calibration, and the location of measurement sites decrease the knowledge and significance of the derived trends. (Weatherhead et al. 1998). To test the robustness of the estimated errors in the derived trends, and to quantitatively establish the representativeness of the trends, it is not sufficient to calculate only the trends and their errors, but it is also necessary to compare trends from independent instruments having overlapping measurements.

In this manuscript, we compare trends of total water vapor columns obtained from different sources. One dataset is from the Global Ozone Monitoring Experiment (GOME) (Burrows et al. 1999)—onboard *ERS-2* since 1995—and the Scanning Imaging Absorption Spectrometer for Atmospheric Chartography (SCIAMACHY) (Bovensmann et al. 1999)—on board *ENVISAT* since 2002. In addition, the water vapor trends have been used from the globally distributed radiosonde network stations archived by the Satellite Application Facility on Climate Monitoring (CMSAF), which is hosted by the Deutscher Wetterdienst (DWD). These data are well suited for comparison and the assessment of significance of trends in water vapor because of their independence. However, the datasets have different sampling times, spatial resolution, etc. So, the issue arises how to best compare these data and the trends derived from them.

The appropriate comparison of data or quantities such as means and standard deviations from different measurement techniques is a common problem in a variety of scientific disciplines. One possible solution for this issue has a long tradition, going back at least to 1929, when Behrens (Behrens 1929) proposed a solution to the problem of the accurate determination of the difference of means in data from different sources, assuming that the standard deviations of the different datasets are not equal and unknown. Fisher (1937) found a solution to the problem using fiducial inference. The Behrens–Fisher problem has also been solved by Jeffreys (1939) and recently, for example, by Bretthorst (1993) and Moreno et al. (1999), who have used Bayesian probability analysis. Currently, examples of the Behrens–Fisher problem are mostly solved by using the frequentist statistics approach—namely, the Welch test (Welch 1947)—which is an adaption of a Student’s *t* test. Unfortunately, the Behrens–Fisher problem is often incorrectly simplified in such a way that a difference of given quantities is stated as significant if the respective error bars do not overlap, as discussed by Lanzante (2005).

In this study the Behrens–Fisher problem associated with the derivation of trends in water vapor from different time series having different, and possibly unknown, standard deviations has been investigated. The objective of our analysis is twofold. First, methods for trend comparisons in the sense of hypothesis testing under the frequentist and Bayesian framework have been developed. Second, the methods have been applied to water vapor data in a demonstration study, which shows that valuable meteorological information is gained successfully by applying the methods developed in this manuscript.

The manuscript is structured as follows. In section 2, we introduce the databases used, that is, the water vapor measurements from satellites and ground stations. In section 3, we briefly describe the two relevant schools of statistics—“the frequentist” (standard) approach and the Bayesian concepts—and explain their underlying philosophy. The statistical methods used are discussed in section 4, including the trend calculation, the Welch test for trends, the Bayesian model selection applied to trends, a sensitivity study regarding prior information, and the derivation of an approximation of the Bayesian method. The three approaches are applied to the GOME–SCIAMACHY and radiosonde water vapor time series and compared in section 5. Finally, the conclusions are given in section 6.

## 2. The water vapor measurements

### a. GOME–SCIAMACHY

The global total water vapor column amounts used in the present study have been retrieved by the Air Mass Corrected Differential Optical Absorption Spectroscopy approach (AMC–DOAS) (Noël et al. 2004) from observation of the upwelling radiation at the top of the atmosphere in the visible range measured by the GOME instrument onboard the orbiting platform *ERS-2*, launched in April 1995, and the SCIAMACHY spectrometer onboard *ENVISAT*, launched in March 2002. These satellites fly in sun-synchronous orbits in descending node at a height of about 785 km. GOME crosses the equator at 1030 local time and has a spatial resolution of typically 40 km × 320 km: global coverage being achieved after about three days. SCIAMACHY crosses the equator at 1000 local time, its spatial resolution being somewhat better, typically at 30 km × 60 km, whereas global coverage is achieved within six days. Since the instruments are dependent on sun light, no information on water vapor is retrieved during night, which results in no data being obtained during the polar night. Scenes that have too large cloud cover and regions with extreme elevation (e.g., Himalayas) are excluded from the analysis (Noël et al. 2004). The GOME–SCIAMACHY dataset is gridded on a spatial 0.5° × 0.5° lattice and then accumulated to monthly means. This yields a cloud-cleared monthly mean water vapor climatology between 1000 and 1030 local time.

### b. Radiosondes

Radiosondes are devices carried on small balloons to heights around 35 km. Typically the parameters pressure, altitude, geographical position, temperature, relative humidity, and wind speed are recorded and sent to a ground station. Comprehensive information on different sensors, designs, calibration, etc. can be found, for example, in Elliott and Gaffen (1991), Garand et al. (1992), Lally (1985), and Vaisala (1989).

Under the framework of the World Meteorological Organization a network of globally distributed radiosonde stations perform regular observations, which have been quality-controlled and archived at the DWD. In this study, 187 high quality time series (from more than 900 stations in total) are considered. The measured water vapor profiles have been integrated up to 100 hPa to yield the total column amounts. Typically the radiosondes are launched daily at 0600, 1200, 1800, and 0000 local time. These data have been averaged to derive monthly means, which are comparable with those determined from the satellite data. The time span is selected to match the satellite time range and the comparison is performed for data from January 1996 to December 2007.

There are several differences between the two datasets. One of the most important differences is horizontal spatial resolution, which is relatively low, that is, poor (0.5° × 0.5°) for the satellite data and high (point measurements) for the radiosonde observations. Further differences in the water vapor data are expected because of the AMC–DOAS data being cloud cleared, whereas the radiosonde data contain all-sky measurements. The difference between clear-sky and all-sky relative humidity has been explored, for example, by Kahn et al. (2009) for the upper troposphere, where a large impact has been observed. Since we are dealing with the total column water vapor, these effects in the upper troposphere have only a minor contribution. However, Sohn and Bennartz (2008) have investigated the clear-sky bias for total column water vapor and found a bias of about 0.2 g cm^{−2} for zonal means between clear-sky and all-sky data. Additional biases can be expected due to diurnal variations of water vapor, as the satellite measurements in principle present a morning climatology (equator crossing at about 1000 local time), whereas the radiosondes sample the complete day. To account for such biases we use individual offsets for each dataset in the regression procedure in section 4a.

In spite of the differences in sampling there are important similarities in both datasets, for example, the temporal resolution being monthly. The averaging to monthly means typically smears out small-scale fluctuations and the individual offsets account for biases, which makes a comparison possible between the two datasets.

## 3. The two schools of statistics

The frequentist statistics approach was primarily developed by, for example, Fisher, Neyman, and Pearson at the beginning of the twentieth century. The underlying philosophy of the frequentist statistics is the interpretation of an event probability as the limit of its relative frequency for a large number of trials. A major component of how statistics is used in environmental science is the hypothesis testing, which is used under the framework of induction to make decisions using experimental data. The basic concept of hypothesis testing is to set up a null hypothesis *H*_{0}, which is assumed to be true and an alternative hypothesis *H*_{1}, which is the complementary event of *H*_{0}. Then the probability of exceeding a value of a test statistic (according to *H*_{0}) is inferred. The null hypothesis is typically rejected if the observed probability is below a significance level of, for example, *α* = 0.05. Such a case would confirm the alternative hypothesis.

Bayesian statistics has been developed by Bayes (1763) and de Laplace (1812). It is much older than the frequentist approach but was then largely forgotten until Jeffreys (1939) rediscovered the ideas of Bayes and de Laplace. The Bayesian concepts have undergone a renaissance in the late twentieth century, in part as a result of the increase of computational power. Influence on the Bayesian development in recent times has been contributed by, for example, Jaynes and Bretthorst (2003). An advantage of the Bayesian formalism is that it is based completely on probability theory (Jaynes and Bretthorst 2003), whereas the frequentist statistics represents rather a compilation of a large amount of tests and methods. Hypothesis testing can also be accomplished within the Bayesian framework. However, Bayesian hypothesis testing is better described as a model selection procedure, that is, inferring which model or hypothesis has the higher probability to explain certain data or phenomena.

The major differences between frequentist and Bayesian statistics are as follows.

*Philosophical difference:*The deep philosophical difference is that the parameters are fixed (but unknown) in the former and have some randomness in the form of a prior (or degree of belief) distribution in the latter. Data is used by Bayesians to update the prior knowledge in the form of a prior distribution, resulting in a posterior distribution that expresses the relative evidence of the parameters values given the data and the prior knowledge. In contrast, frequentists calculate statistics from the data to estimate the parameters and calculate the distribution of these statistics, given in (hypothetical) other datasets generated under the same model and assumed fixed parameter values. This can be elucidated by conditional probabilities. A conditional probability is the probability of an event*X*given the occurrence of another event*Y*and is denoted as*P*(*X*|*Y*). In the frequentist approach*X*could be data or a statistic derived from the data and*Y*a hypothesis, for example, a model with a particular parameter value that is assumed to have generated the data. In hypothesis testing frequentists then calculate the exceedance probability*P*(*X*>*X*_{0}|*Y*), where*X*_{0}is the (test) statistic calculated from the data at hand. Bayesians can give the reverse,*P*(*Y*|*X*), the probability of the hypothesis given the data. The frequentists probability requires the notion of other (hypothetical) datasets generated from the same model and parameters, whereas the Bayesian probability conditions on the particular data at hand.*Prior information:*Bayesian methods utilize prior information about the truth of a hypothesis or parameter range, which reflects the knowledge (or ignorance) before the data have been analyzed. This enables a new quality of statistical inference, for example, the investigation of the evidence for human-induced climate change, as performed by Lee et al. (2005). In frequentist statistics such prior information does not exist.

*X*and

*Y*are propositions, and

*I*denotes the relevant background information. The

*I*is often neglected, but it has to be kept in mind that no absolute probabilities exist without certain background assumptions or information. The

*P*(

*Y*|

*X*,

*I*) is called the

*posterior*probability;

*P*(

*X*|

*Y*,

*I*) is the likelihood;

*P*(

*Y*|

*I*) is the

*prior*probability; and

*P*(

*X*|

*I*) has formerly been called the marginalization likelihood, for which Sivia and Skilling (2006) have introduced the term “evidence.”

## 4. Methods

### a. Estimating trends from time series

As one quality criterion, the time series have to include at least two-thirds of all (144) data points considered for the comparison. This is because large data gaps are not representative. The disadvantage of the two-thirds criterion is that only 187 radiosonde time series fulfill this requirement, whereas the satellite data fulfill the criterion in all 908 cases.

Global GOME–SCIAMACHY water vapor trends have been calculated for the time span from 1996 to 2006 in Mieruch et al. (2008), where the methods have been adopted and slightly expanded from Weatherhead et al. (1998). We have extended the trend analysis, which now includes the time from 1996 to 2007.

In the following the water vapor trend estimation is shortly discussed. For a more detailed description we refer to Mieruch et al. (2008) and Weatherhead et al. (1998).

*Y*

_{1t}contains the monthly mean water vapor data,

*μ*

_{1}is a constant to be estimated, and

*C*

_{1t}equals unity for all

*t*and is needed when assessing autocorrelations;

*S*

_{1t}is the seasonal component,

*ω*

_{1}represents the trend, and

*X*

_{1t}contains the time. The subscript “1” is used for the satellite data, whereas we will use the subscript “2” for the radiosonde trend model.

*δ*at and after time

*t*=

*T*

_{0}(1 <

*T*

_{0}<

*T*

_{1}), where

*T*

_{0}= 85 represents the intersection of GOME and SCIAMACHY data on January 2003. Here

*U*describes a step function:

_{t}The noise *N*_{1t} is modeled as an autoregressive process of order one [AR(1)], that is, *N*_{1t} = *ϕ*_{1}*N*_{1t−1} + *ε*_{1t} (Schlittgen and Streitberg 1997) to consider autocorrelations in the data, where *ε*_{1t} is an independent random variable with zero mean and variance *σ*_{1}^{2}. The magnitude of autocorrelation *ϕ*_{1} is restricted to −1 < *ϕ*_{1} < 1 and is estimated using the discrete autocorrelation function (Edelson and Krolik 1988), which can account for gaps in the data.

*δ*and the trend

*ω*

_{1}are invariant under the calculation of anomalies. Thus, the deseasonalized GOME–SCIAMACHY water vapor is now modeled by

*N*

_{1t}is transformed to white noise

*ε*

_{1t}using the AR(1). The autocorrelations are accommodated by the variables in Eq. (4) (now indicated with the asterisk superscript), thus we observe

Comparing the trend results derived here with those shown in Mieruch et al. (2008) results in small differences, indicating that the expansion of the data to 2007 influences the trends very slightly.

*σ*

_{1}have been determined within the regression procedure. Note that the standard error refers to the standard deviation divided by the sample size. Since we are interested in trends, only the error of the trend and the standard deviation of the noise are needed for the hypothesis testing. The trend model for the radiosonde measurements

*σ*

_{2}have been estimated. For the approximation of the Bayesian method, shown in section 4e, the standard deviation of the noise from the pooled data with a single trend is needed, which can be obtained by applying the least squares regression to the pooled data

*μ*

_{p}_{1}and

*μ*

_{p}_{2}in Eq. (7).

*l*are the respective lengths of the time series

_{i}### b. Welch test applied to trends

*t*test; thus we assume the independence of satellite and radiosonde measurements. This is a reasonable assumption regarding the differences between the datasets, which have been introduced in section 2. In the following, the Welch test is applied to trends from time series. The null hypothesis

*H*

_{0}:

*d*=

*ω*

_{1}−

*ω*

_{2}= 0 postulates that the difference of the two trends is equal to zero, whereas the alternative hypothesis is

*H*

_{1}:

*d*=

*ω*

_{1}−

*ω*

_{2}≠ 0. The standard error of the difference

*d*is observed aswhere

*i*= 1, 2.

*t*statistic is then given byAccordingly the

*t*distribution withdegrees of freedom [Eq. (13) is called the Welch–Satterthwaite equation (Satterthwaite 1946)] has to be integrated from

*t*

_{0}to ∞. The result has to be multiplied by the factor 2 because no prior information on the sign of

*d*exists, which requires a two-tailed test. Finally, the exceedance probability

*P*(

*t*>

*t*

_{0}|

*H*

_{0}) is derived where

*t*

_{0}is the result of Eq. (12) calculated from the data. The integrals of the

*t*distribution are typically tabulated in several high level programming languages such as Octave (http://www.gnu.org/software/octave/).

### c. Bayesian model intercomparison

In the following, a Bayesian method to compare trends in time series is presented. The Bayesian model selection for the difference of trends is based on the works of Bretthorst (1993) and Sivia and Skilling (2006) who estimate the difference of means and standard deviations between two sets of data. For this study, the methods have been extended to compare trends.

We set up two hypotheses:

*A*: Both sets have a common (unknown) trend*ω*;*B*: The two datasets have individual (unknown) trends*ω*_{1}and*ω*_{2}.

Note that the magnitudes of the trends do not matter. Hypothesis *A* corresponds to Eq. (7), while hypothesis *B* corresponds to Eqs. (5) and (6).

*A*, given the respective data using the Bayes theorem, is estimated:where the

**D**

_{1}and

**D**

_{2}represent the two datasets, and

*I*describes certain relevant background information.

**p**

_{1}= (

*μ*

_{p}_{1},

*μ*

_{p}_{2},

*ω*,

_{p}*δ*,

_{p}*σ*

_{p}_{1},

*σ*

_{p}_{2}) from Eq. (7) are irrelevant, we can use the marginalization rule (cf. Sivia and Skilling 2006 and Bretthorst 1993) and integrateAssuming logical independence of the prior probabilities of the hypothesis

*A*and the parameters

**p**

_{1}:In the same way as in Eq. (15) the posterior for hypothesis

*B*is derived:with

**p**

_{2}= [

*μ*

_{1},

*μ*

_{2},

*ω*

_{1},

*ω*

_{2}

*δ*,

*σ*

_{1},

*σ*

_{2}] from Eqs. (5) and (6) and

*P*(

**D**

_{1},

**D**

_{2}|

*I*) is the evidence (cf. section 3), in this case:Here

*P*(

*A*|

*I*) is the prior probability for hypothesis

*A*. As there is no reason to prefer either this hypothesis or the alternative

*P*(

*B*|

*I*), we assign both with the probability 0.5; thus they cancel out in the ratios given in Eqs. (15) and (19).

*P*(

**p**

_{1}|

*I*) and

*P*(

**p**

_{2}|

*I*) in (15) and (19) do not have to be integrated because they are independent from the parameters themselves and are realized by choosing them as bounded priors in the form of fully normalized uniform distributions:That is, it is assumed that all

**p**

*in the interval [*

_{i}**p**

_{i min},

**p**

_{i max}] have the same probability. All prior probabilities, except the trend priors, occur in the numerator and denominator of Eqs. (15) and (19), respectively; thus they cancel out. The priors of the pooled trend and the separate trends are chosen as

*P*(

*ω*|

*I*): =

*P*(

*ω*|

_{p}*I*) =

*P*(

*ω*

_{1}|

*I*) =

*P*(

*ω*

_{2}|

*I*) with

This prior information provides the probabilistic analysis and interpretation of the results. If the range of possible trends is increased, larger differences of trends are probable and vice versa. A sensitivity analysis on the trend priors is given in section 4d. Fortunately, the trend study of Mieruch et al. (2008) provides beneficial information on the range of the trends. The boundaries for the three trend priors in Eq. (25) are chosen as *ω*_{min} = −0.1 and *ω*_{max} = +0.1 g cm^{−2} yr^{−1}. This trend range comprises more than 99.9% of all water vapor trends for the time span 1996 to 2006 and is a lower bound. Any decrease of the trend range would result in a truncation of the probability space. Finally, another trend prior cancels out in each of the ratios of Eqs. (15) and (19).

**D**

_{1}and

**D**

_{2}from radiosonde and satellite measurements are assumed to be independent on a noise basis as well:andA Gaussian likelihood is assumed such that the residuals

*ε*

_{1t,}

*ε*

_{2t}, and

*ε*are normally distributed. The

_{pt}**D**

_{1}comprise

*l*

_{1}independent measurements {

*D*

_{1t}}, and the

**D**

_{2}comprise

*l*

_{2}independent measurements {

*D*

_{2t}}, leading toandwhere the asterisks have been dropped for convenience, but the transformed data remain addressed (cf. section 4a).

^{5}and nos = 10

^{5}.

### d. Sensitivity analysis

An important step in Bayesian analysis is the choice of prior information. As shown above, we have chosen the prior for the trend parameter as a fully normalized uniform distribution in the range from *ω*_{min} = −0.1 to *ω*_{max} = +0.1 g cm^{−2} yr^{−1}, hence the prior range is Δ*ω* = 0.2 and *B*. This circumstance is known as *Ockham*’*s Razor*, a principle that recommends the selection of an accurate theory or model having the fewest assumptions and postulates when multiple competing theories are equal in describing respective phenomena. Ockham’s Razor is naturally implemented in the Bayesian concept in such a way that a theory is penalized for every additional parameter automatically.

We can qualitatively derive the *Ockham factor*, which is also shown in Sivia and Skilling (2006) and Dose and Menzel (2004). If model *B* is the more complex hypothesis and model *A* is the simpler one, there is one more dimension to integrate over, denoted as *ω*_{2}, in the case of model *B*. This contribution to the integral is proportional to the width of the probability density function *P*(*B*|**D**_{1}, **D**_{2}, *I*) in this direction (denoted as *δω*_{2}). With *P*(*ω*_{2}|*I*) = 1/Δ*ω*_{2}, we see that the Ockham factor is ≈*δω*_{2}/Δ*ω*_{2}. This ratio is typically smaller than unity and thus penalizes model *B* for its additional parameter.

On the one hand, prior information is a great advantage of Bayesian methods; however, on the other hand, the prior information represents a certain degree of subjectivity influencing the results. A sensitivity analysis with respect to the trend prior has been undertaken. Figure 1 depicts the results, where we have chosen an exemplary situation in which the probability for hypothesis *A* yields *P*(*A*|**D**_{1}, **D**_{2}, *I*) = 0.7 and the probability for hypothesis *B* yields *P*(*B*|**D**_{1}, **D**_{2}, *I*) = 0.3 using our chosen trend prior. These results are depicted in Fig. 1 as black and gray points. The *x* axis in Fig. 1 represents a change of the prior probability *P*(*ω*|*I*) = 5 in percent. For instance, a change of 10% corresponds to an increase of the prior to *P*(*ω*|*I*) = 5.5 and accordingly to a decrease of the trend prior range of *ω*_{min} = −0.09 and *ω*_{max} = +0.09 g cm^{−2} yr^{−1}. As can be seen from Fig. 1, *P*(*A*|**D**_{1}, **D**_{2}, *I*) decreases with increasing *P*(*ω*|*I*) and decreasing Δ*ω*, and vice versa for *P*(*B*|**D**_{1}, **D**_{2}, *I*). As mentioned in section 4c, the choice of our trend range constitutes a lower bound; thus in principle only an increase of the trend range would make sense, avoiding any truncation of the probability space. This would decrease *P*(*ω*|*I*) and, as can be seen in Fig. 1, increase *P*(*A*|**D**_{1}, **D**_{2}, *I*)—namely, the probability of a common trend. However, in a meaningful range of deviations from our prior, within ±20%, the results are quite insensitive. For unrealistically large deviations of the prior from our choice, large changes of the results are observed. This is shown in the small embedded figure of Fig. 1. In conclusion, a reasonable prior has been selected, and our results are insensitive to changes of this prior—at least in a range of ±20%.

### e. Analytical approximation

DEMC is a sophisticated and powerful algorithm that goes beyond what is implemented in standard computational programming languages or packages. The disadvantage of DEMC is the need for large computational power. Sivia and Skilling (2006) have derived an approximation for a Bayesian method, which compares means and standard deviations of data. This approximation is adapted to the method for trend comparison shown in the following.

*L*= log

_{A}_{e}[

*P*(

**D**

_{1}|

*A*,

*μ*

_{p}_{1},

*ω*,

_{p}*δ*,

_{p}*σ*

_{p}_{1},

*I*)

*P*(

**D**

_{2}|

*A*,

*μ*

_{p}_{2},

*ω*,

_{p}*σ*

_{p}_{2},

*I*)] with a maximum at

*L*/∂

_{A}**p**

_{1}= 0, solving a set of linear equations in a least squares sense. Thus, we can use the parameters

The second term in Eq. (32) contains the vector *L _{A}* evaluated at

*A*exponentiating

*L*, is

_{A}*M*-dimensional Gaussian by,thus Eq. (34) becomesThis analysis yields

*B*, stating that the time series have individual trends

*ω*

_{1}and

*ω*

_{2}, is now derived. The procedure is identical to the previous derivations, using the quadratic Taylor series expansion of the logarithmic likelihood function Eq. (27)The quantities

**K**′

*, and*

_{B}The posterior probability of hypothesis *B* is also proportional to the prior probability of the trends, which is chosen in the same way as in Eq. (25). Eqs. (42) and (43) are analytical functions, which can quite easily be computed in contrast to Eqs. (30) and (31) that can only be solved numerically.

## 5. Results

The comparison methods for trends in time series described above, that is, the Welch test and the Bayesian model selection, have been applied to measured trends from satellite and radiosonde monthly mean water vapor data. The trends have been calculated using the methods described in section 4a. For the comparison a quality criterion is required, that is, both time series have to contain at least two-thirds of the monthly mean measurements over the time span from January 1996 to December 2007, that is, at least 96 data points from the maximal 144. This constraint assures that the trends are representative for the period investigated and less susceptible to possible outliers.

Figure 2a depicts the results of the Welch test. The probabilities *P*(*t* > *t*_{0}|*H*_{0}), with null hypothesis *H*_{0}: *d* = *ω*_{1} − *ω*_{2} = 0 and the difference of the trends *d*, for the 187 trend pairs are plotted versus *d*. High probabilities are observed for small trend differences, while lower probabilities are found for large trend differences, as expected. Figure 2b shows the *P*(*t* > *t*_{0}|*H*_{0}) plotted versus the trend differences normalized to the error of the difference. From the definition of the Welch test it is clear that *P*(*t* > *t*_{0}|*H*_{0}) is totally determined by (*ω*_{1} − *ω*_{2})/*σ _{d}*, where

*ω*

_{1}is the GOME–SCIAMACHY trend and

*ω*

_{2}is the radiosonde trend. In the sense of the frequentist interpretation, the null hypothesis for a single test would be rejected if

*P*(

*t*>

*t*

_{0}|

*H*

_{0}) <

*α*= 0.05, which would apply in 20 of 187 cases (~10%). Hence, in about 90% of the tests we cannot reject the null hypothesis. However, we are dealing with 187 independent tests, so multiple testing would be possibly more appropriate. A common approach used in multiple comparisons is the

*Bonferroni*correction (see, e.g., Perneger 1998), which decreases the significance level

*α*. The reason for this correction is that the number of observed significant test results that occur by chance, performing

*n*tests, is ≤

*nα*. The Bonferroni correction would yield a significance level:

*general*null hypothesis that all null hypotheses are true simultaneously, which has to be rejected if one or more

*p*values are smaller than the significance level

*P*(

*t*>

*t*

_{0}|

*H*

_{0}) from the 187 tests is smaller than

The 187 probabilities of a common trend *P*(*A*|**D**_{1}, **D**_{2}, *I*) from the exact Bayesian model selection (for each trend pair) are plotted versus the difference of the trends in Fig. 2c and versus the trend difference normalized to the error of the difference (*ω*_{1} − *ω*_{2})/*σ _{d}* in Fig. 2d. Additionally, the results from the approximation of the Bayesian method are shown in Figs. 2e and 2f. High probabilities for a common trend are found for small trend differences, whereas the probability is low for large trend deviations as in the case of the Welch test. The approximation slightly overestimates the exact probabilities and the mean relative difference is

*O*(10%), but the general results from the exact method and the approximation are very similar, thus the use of the approximation is recommended for monthly mean water-vapor trend comparison if sophisticated algorithms like DEMC are not available or large datasets have to be analyzed in a short period of time. To judge the probabilities of different hypotheses, Jeffreys (1939) introduced the scale presented in Table 1. Here

*P*(

*A*) and

*P*(

*B*) denote the respective probabilities of the hypotheses, where a value of log

_{10}[

*P*(

*A*)/

*P*(

*B*)] = 1 means that hypothesis

*A*is 10 times more probable than hypothesis

*B*.

Judgement of evidence against hypothesis *B* regarding Jeffreys (1939).

Regarding the Jeffreys scale (Table 1), the evidence against hypothesis *B* is substantial if the logarithm of the so-called Bayes factor, which is *P*(**D**_{1}, **D**_{2}|*A*, *I*)/[*P*(**D**_{1}, **D**_{2}|*B*, *I*) · *p*(*ω*|*I*)] here, is larger than 0.5 and smaller than 1, which corresponds to 0.76 < *P*(*A*|**D**_{1}, **D**_{2}, *I*) < 0.91; hence the evidence against hypothesis *A* is substantial if 0.09 < *P*(*A*|**D**_{1}, **D**_{2}, *I*) < 0.24. Hypothesis *A* is preferred substantially in 49 cases and hypothesis *B* in 9 cases, using the exact method. When the approximation is used, *A* is preferred in 114 cases and *B* in 5 cases. The evidence against *B* is strong to decisive if *P*(*A*|**D**_{1}, **D**_{2}, *I*) > 0.91, which is true in zero cases for the exact solution and true in 10 cases for the approximation. Strong to decisive evidence is drawn against *A* if *P*(*A*|**D**_{1}, **D**_{2}, *I*) < 0.09, which is observed 3 times in the exact case and 2 times in the case of the approximation. The rigorous application of the Bayesian model selection would prefer hypothesis *A* if *P*(*A*|**D**_{1}, **D**_{2}, *I*) > 0.5, which is true in 153 cases of 187, that is, 82% for the exact method and in 165 cases for the approximation. Interpreting the observed patterns in Figs. 2c–f, distinct clusters of data points are found between probabilities of 0.7 to 0.9. These are mostly classified as substantially supporting hypothesis *A* of a common trend. Since this is true for the exact Bayesian method and the approximation, generally similar conclusions are drawn. Nevertheless, since large differences between the exact and the approximation method have been observed for single time series, significant conclusions can only be drawn for the ensemble level. Furthermore, as applied for the Welch test, the argumentation of multiple testing also concerns the Bayesian model selection. Westfall et al. (1997) (and citations therein) suggest a perspective of a Bayesian Bonferroni correction, which acts on the prior information. Accordingly, our trend prior *P*(*ω*|*I*) would be transformed to *ω*_{min} = −0.5 and *ω*_{max} = +0.5 g cm^{−2} yr^{−1}, increasing the *P*(*A*|**D**_{1}, **D**_{2}, *I*) decisively and hence supporting hypothesis *A* in general, which is shown in Fig. 3.

In the following section, examples of GOME–SCIAMACHY and radiosonde water vapor time series are analyzed. Figure 4a shows the deseasonalized GOME–SCIAMACHY and radiosonde monthly mean water vapor columns together with their linear trends from Nottingham, England. For visual presentation the GOME–SCIAMACHY level shift has been removed. The human visual system is quite sophisticated in the identification of diverse patterns and also in comparing trends. From Fig. 4a it is clear that the trend difference is small and, indeed, the trends are nearly equal with *ω*_{1} − *ω*_{2} = 0.001 g cm^{−2} yr^{−1} and (*ω*_{1} − *ω*_{2})/*σ _{d}* = 0.14 (cf. Fig. 2). The Welch test gives a probability of

*P*(

*t*>

*t*

_{0}|

*H*

_{0}) = 0.89, which cannot be rejected.

The Bayesian hypothesis *B* is visualized schematically in Fig. 4a by modeling the data with two trends. Hypothesis *A* is illustratively shown in Fig. 4b by pooling the data and applying a single trend. For visual presentation the offsets of GOME–SCIAMACHY and radiosonde data have been removed. From the Bayesian point of view hypothesis *A* is substantially preferred with *P*(*A*|**D**_{1}, **D**_{2}, *I*) = 0.82. The approximation method gives *P*_{approx}(*A*|**D**_{1}, **D**_{2}, *I*) = 0.90. Hence, for small trend differences, both the frequentist and Bayesian concept reveal quite large probabilities for the respective tests.

Low probabilities are found, for example, at Albany Airport in Australia. The time series are shown in Figs. 4c and 4d. The visual inspection definitely classifies the trends as different. The trend difference is actually *ω*_{1} − *ω*_{2} = 0.04 g cm^{−2} yr^{−1} and after normalization it is (*ω*_{1} − *ω*_{2})/*σ _{d}* = 3.2. The Welch test gives a probability of

*P*(

*t*>

*t*

_{0}|

*H*

_{0}) = 0.002 (rejection of the null hypothesis). The exact Bayesian finds

*P*(

*A*|

**D**

_{1},

**D**

_{2},

*I*) = 0.02 and the approximation yields

*P*

_{approx}(

*A*|

**D**

_{1},

**D**

_{2},

*I*) = 0.04 (preferring

*B)*. Thus, low probabilities are found for large trend differences by both statistical methods. Different probabilities for the frequentist test and the Bayesian method, as can be seen from Fig. 2, are found in the range between small and large trend differences. As an example, a pair of water vapor time series from Meiningen, Germany, is chosen with a trend difference of

*ω*

_{1}−

*ω*

_{2}= 0.014 g cm

^{−2}yr

^{−1}and a normalized trend difference of (

*ω*

_{1}−

*ω*

_{2})/

*σ*= 1.3. The probability derived from the Welch test yields

_{d}*P*(

*t*>

*t*

_{0}|

*H*

_{0}) = 0.19, which is small, but not small enough to reject the null hypothesis. The exact Bayesian gives

*P*(

*A*|

**D**

_{1},

**D**

_{2},

*I*) = 0.89, and the approximation is

*P*

_{approx}(

*A*|

**D**

_{1},

**D**

_{2},

*I*) = 0.75. Here it again has to be mentioned that both methods (Welch test/Bayes) reveal different probabilities and are both correct under the respective frameworks of the frequentist philosophy and the Bayesian concept. The exact probabilities of the frequentist and Bayesian method differ; however, the conclusions are nevertheless similar.

The GOME–SCIAMACHY trends are plotted in Fig. 5, where the 187 radiosonde trends have been embedded into the figure indicated by black, gray, and white bordered circles. The circles of radiosonde trends are filled with the color for the magnitude of the respective trends according to the color bar used for the GOME–SCIAMACHY data as well. The borders of the circles indicate the Bayesian posterior probabilities *P*(*A*|**D**_{1}, **D**_{2}, *I*) at a specific geolocation. A black border indicates a probability ≤0.5, which means that hypothesis *B* is preferred. Seven of the total 34 black bordered circles are covered by the other circles and cannot be seen in the figure. A gray-bordered circle represents probabilities >0.5 and ≤0.76 where hypothesis *A* is favored (104 circles), and a white border indicates that hypothesis *A* is substantially preferred with Bayesian probabilities >0.76 (49 circles).

One reason for discrepancies between satellite and radiosonde trends are data gaps in the radiosonde data. This has been observed, for example, at Minqin, China, shown in Figs. 4g and 4h. Radiosonde data are often missing in summer, especially in 2006 and 2007 when high water vapor was observed by SCIAMACHY. The Welch test gives *P*(*t* > *t*_{0}|*H*_{0}) = 0.01, which implies that the null hypothesis should be rejected. The exact Bayesian gives *P*(*A*|**D**_{1}, **D**_{2}, *I*) = 0.43, hence preferring hypothesis *B*. Again, the individual *p* values of both methods are different, but the conclusions are similar.

As mentioned in section 2b one possible important reason for discrepancies between observed trends from satellite and radiosonde water vapor data is the different resolution of the two instruments. Radiosondes can capture local events, whereas the satellite measurement is an average over a large area. This will be shown in the following using an example from the west coast of Saudi Arabia. A zoom into this region is depicted in Fig. 6. Note that in Fig. 6 the color scale used for the GOME–SCIAMACHY and radiosonde trends is different from the one used in Fig. 5. Here a positive water vapor trend is observed with a radiosonde measurement located exactly at the city of Jeddah. The satellite trends in the near vicinity of the town are enhanced as well but are not as strong as the very localized radiosonde trend. However, it seems possible that changes in the total water vapor column, observed via satellite, can be attributed to human activities with a high probability, identifying Jeddah as a source of water vapor. Further, increasing water vapor is observed exactly at the city of Asmara in Eritrea as well (cf. Fig. 6), which shows that urban areas, and hence anthropogenic influence on water vapor changes, can be detected using satellite observations. The satellite measurements of the positive trends over Jeddah are rather smeared out over a larger region. This is a likely explanation for the relatively low probability for hypothesis *A* between the observed trends, which is indicated by the gray-bordered circle.

## 6. Conclusions

In this manuscript, we have solved the Behrens–Fisher problem (same means, unequal unknown standard deviations) for the analysis of trends from temporal series for water vapor, derived by different methods using two schools of statistics, the frequentist (standard) approach and the Bayesian concepts. We have applied both methods to global water vapor datasets observed by satellite (GOME–SCIAMACHY) and radiosonde instruments.

To utilize and assess the value of frequentist statistics, the widely used Welch test has been applied to address the Behrens–Fisher problem. The individual null hypothesis *H*_{0}: *d* = *ω*_{1} − *ω*_{2} = 0, stating that the difference of the trends is equal to zero, could only be rejected in 10% of the 187 trend comparisons. Additionally we have applied multiple testing by using the Bonferroni correction to test the general null hypothesis stating that all single null hypotheses are true, which could not be rejected. These results increase our confidence, in a climatological sense, that satellite and radiosondes measure the true changes of total column water vapor, assuming no long-term calibration issues for the radiosondes, negligible changes in cloud cover, and an adequate combination of the GOME and SCIAMACHY data.

Concerning the Bayesian model selection we estimated the probabilities for the hypotheses

*A*: Both sets have a common (unknown) trend*ω*;*B*: The two datasets have individual (unknown) trends*ω*_{1}and*ω*_{2}.

*A*(cf. Fig. 2). This provides evidence supporting the hypothesis

*A*, namely, that the observed trends are common for the two datasets. Overall, the analysis of the datasets for the two different hypotheses of having either common and individual trends show that the common trend is preferred in 153 cases of 187 (82%) and that generally hypothesis

*A*has a larger probability than hypothesis

*B*. We attribute the cases where the results are poorer to either an inadequate calibration of the long-term datasets from the radiosondes or local-scale changes, which are smoothed in the lower spatially resolved remote sounding dataset. In addition, the use of a Bonferroni-like correction in a Bayesian sense results in a clear preference of hypothesis

*A*, thus hinting at a common trend in satellite and radiosonde measurements.

In conclusion both approaches, the frequentist Welch test and the Bayesian model selection, yield similar results when used to compare water vapor trends from the two independent measurement sets on an ensemble basis, including or neglecting multiple testing. However, differences can occur when using the two tests to compare trends at a given location. The frequentist method is recommended, provided it is sufficient to work with a given null hypothesis and thereby infer the probability of a parameter and associated significance test. If this is not the case, then we recommend the Bayesian approach, where it is possible to derive probabilities of competing hypotheses. From a climatological viewpoint evidence that trends in time series from both observing systems, radiosondes, and satellites are real significant and not statistical artifacts has been found. The size of the trends is not discussed further because the length of the time series is short and therefore the trends observed are not yet identifying a climate signal.

## Acknowledgments

SCIAMACHY is a national contribution to the ESA *ENVISAT* project, funded by Germany, the Netherlands, and Belgium. SCIAMACHY data have been provided by ESA. Radiosonde data have been provided by DWD. This work has been funded in part by the University and state of Bremen, the German Aerospace (DLR) by the EU (6th and 7th framework ACCENT and ACCENT plus projects), and by EUMETSAT in the framework of the Climate Monitoring (CM-SAF) part of the Satellite Application Facilities Network. We thank D. S. Sivia and G. L. Bretthorst for advice concerning prior probabilities. We also thank C. ter Braak for discussion on the DEMC algorithm. Fruitful discussions are acknowledged to A. C. Davison and J. A. Freund during the 11th IMSC in Edinburgh. We thank Kirsten Schnülle for proofreading the manuscript. We grateful acknowledge the critical comments of three anonymous referees, which have improved the manuscript significantly.

## APPENDIX

### Analytical Approximation—The Matrices

## REFERENCES

Bayes, T., 1763: An essay towards solving a problem in the doctrine of chances.

,*Philos. Trans. Roy. Soc.***53**, 330–418.Behrens, W. V., 1929: Ein beitrag zur fehlerberechnung bei wenigen beobachtungen.

,*Landwirtschaftliche Jahrbücher***68**, 807–837.Bovensmann, H., , J. P. Burrows, , M. Buchwitz, , J. Frerick, , S. Noël, , V. V. Rozanov, , K. V. Chance, , and A. H. P. Goede, 1999: SCIAMACHY–Mission objectives and measurement modes.

,*J. Atmos. Sci.***56**, 127–150.Bretthorst, G. L., 1993: On the difference in means.

*Physics and Probability,*Cambridge University Press, 177–194.Burrows, J. P., and Coauthors, 1999: The Global Ozone Monitoring Experiment (GOME): Mission concept and first scientific results.

,*J. Atmos. Sci.***56**, 151–175.de Laplace, P. S., 1812:

*Théorie Analytique des Probalités*. Courcier Imprimeur, 506 pp.Dose, V., , and A. Menzel, 2004: Bayesian analysis of climate change impacts in phenology.

,*Global Change Biol.***10**, 259–272.Edelson, R. A., , and J. H. Krolik, 1988: The discrete correlation function: A new method for analyzing unevenly sampled variability data.

,*Astrophys. J.***333**, 646–659.Elliott, W. P., , and D. J. Gaffen, 1991: On the utility of radiosonde humidity archives for climate studies.

,*Bull. Amer. Meteor. Soc.***72**, 1507–1520.Fisher, R. A., 1937: The comparison of samples with possibly unequal variances.

,*Ann. Eugen.***9**, 174–180.Garand, L., , C. Grassotti, , J. Halle, , and G. L. Klein, 1992: On differences in radiosonde humidity—Reporting practices and their implications for numerical weather prediction and remote sensing.

,*Bull. Amer. Meteor. Soc.***73**, 1417–1423.Gilks, W. R., , S. Richardson, , and D. Spiegelhalter, 1995:

*Markov Chain Monte Carlo in Practice*. Chapman & Hall/CRC, 512 pp.Jaynes, E. T., , and L. G. Bretthorst, 2003:

*Probability Theory: The Logic of Science: Principles and Elementary Applications*. Vol 1. Cambridge University Press, 758 pp.Jeffreys, H., 1939:

*Theory of Probability*. 3rd ed. Oxford University Press, 472 pp.Kahn, B. H., , A. Gettelman, , E. J. Fetzer, , A. Eldering, , and C. K. Liang, 2009: Cloudy and clear-sky relative humidity in the upper troposphere observed by the A-train.

,*J. Geophys. Res.***114**, D00H02, doi:10.1029/2009JD011738.Keeling, C. D., , S. C. Piper, , R. B. Bacastow, , M. Wahlen, , T. P. Whorf, , M. Heimann, , and H. A. Meijer, 2001: Exchanges of atmospheric CO2 and 13CO2 with the terrestrial biosphere and oceans from 1978 to 2000. Observations and carbon cycle implications.

*A History of Atmospheric CO2 and Its Effects on Plants, Animals, and Ecosystems: I. Global Aspects.,*J. R. Ehleringer et al., Eds., Springer, 83–113.Lally, V. E., 1985: Upper air in situ observing systems.

*Handbook of Applied Meteorology,*John Wiley & Sons, Inc., 352–360.Lanzante, J. R., 2005: A cautionary note on the use of error bars.

,*J. Climate***18**, 3699–3703.Lee, T. C. K., , F. W. Zwiers, , G. C. Hegerl, , X. Zhang, , and M. Tsao, 2005: A Bayesian climate change detection and attribution assessment.

,*J. Climate***18**, 2429–2440.Liu, J. S., 2003:

*Monte Carlo Strategies in Scientific Computing*. Springer, 360 pp.Mieruch, S., 2010: Identification and statistical analysis of global water vapour trends based on satellite data. Ph.D. thesis.

Mieruch, S., , S. Noël, , H. Bovensmann, , and J. P. Burrows, 2008: Analysis of global water vapour trends from satellite measurements in the visible spectral range.

,*Atmos. Chem. Phys.***8**, 491–504.Moreno, E., , F. Bertolino, , and W. Racugno, 1999: Default bayesian analysis of the Behrens-Fisher problem.

,*J. Stat. Plann. Infer.***81**, 323–333.Noël, S., , M. Buchwitz, , and J. P. Burrows, 2004: First retrieval of global water vapour column amounts from SCIAMACHY measurements.

,*Atmos. Chem. Phys.***4**, 111–125.Perneger, T. V., 1998: What’s wrong with Bonferroni adjustments.

,*BMJ***316**, 1236–1238.Robert, C. P., , and G. Casella, 2005:

*Monte Carlo Statistical Methods*. Springer, 536 pp.Satterthwaite, F. E., 1946: An approximate distribution of estimates of variance components.

,*Biom. Bull.***2**, 110–114, doi:10.2307/3002019.Schlittgen, R., , and B. H. J. Streitberg, 1997:

*Zeitreihenanalyse*. Oldenbourg.Schulz, J., and Coauthors, 2009: Operational climate monitoring from space: The EUMETSAT Satellite Application Facility on Climate Monitoring (CM-SAF).

,*Atmos. Chem. Phys.***9**, 1687–1709.Sivia, D. S., , and J. Skilling, 2006:

*Data Analysis: A Bayesian Tutorial*. Oxford University Press, 208 pp.Sohn, B. J., , and R. Bennartz, 2008: Contribution of water vapor to observational estimates of longwave cloud radiative forcing.

,*J. Geophys. Res.***113**, D20107, doi:10.1029/2008JD010053.ter Braak, C., 2006: A Markov Chain Monte Carlo version of the genetic algorithm differential evolution: Easy Bayesian computing for real parameter spaces.

,*Stat. Comput.***16**, 239–249.Vaisala, 1989: RS 80 Radiosondes. Upper-Air Systems product information. Vaisala Inc. Reference R0422-2, 16 pp.

Weatherhead, E. C., and Coauthors, 1998: Factors affecting the detection of trends: Statistical considerations and applications to environmental data.

,*J. Geophys. Res.***103**, 17 149–17 161.Welch, B. L., 1947: The generalization of “student’s” problem when several different population variances are involved.

,*Biometrika***34**, 28–35.Westfall, P. H., , W. O. Johnson, , and J. M. Utts, 1997: A Bayesian perspective on the Bonferroni adjustment.

,*Biometrika***84**, 419–427.