A comprehensive method is provided for smoothing noisy, irregularly sampled data with non-Gaussian noise using smoothing splines. We demonstrate how the spline order and tension parameter can be chosen a priori from physical reasoning. We also show how to allow for non-Gaussian noise and outliers that are typical in global positioning system (GPS) signals. We demonstrate the effectiveness of our methods on GPS trajectory data obtained from oceanographic floating instruments known as drifters.
In 2011 an array of floating ocean surface buoys (drifters) were deployed in the Sargasso Sea to assess the lateral diffusivity of oceanic processes (Shcherbina et al. 2015). Each drifter was equipped with a global positioning system (GPS) receiver recording locations every 30 min. Addressing the primary goal of understanding the processes controlling lateral diffusivity requires significant processing of the drifter positions, including removing mean flow, accounting for the large-scale strain field, and analyzing the residual spectra for hints of a dynamical process. However, it quickly became clear that the GPS position data, which can have accuracies as low as a few meters (Wide Area Augmentation System T and E Team 2016), were contaminated by outliers with position jumps of hundreds of meters or more. Prior to analysis, the position data require removing outliers as well as interpolating gaps to keep the position data synchronized in time across the drifter array.
The basic problem is ubiquitous: observations from GPS receivers return observed positions xi at times ti that differ from the true positions xtrue(ti) by some noise εi ≡ xi − xtrue(ti) with variance σ2. The goal of smoothing is to find the true position xtrue(ti) that is not contaminated by the noise, whereas the goal of interpolating is to find the true position xtrue(t) between observation times. The approach taken here is to use smoothing splines. This approach is relatively broad (Handcock et al. 1994; Nychka 2000), and is related to the methods in Yaremchuk and Coelho (2015) and Elipot et al. (2016) for smoothing drifter trajectories, as discussed later.
Our model for the “true” path x(t) is specified using interpolating B-splines XK(t) such that
where K is the order (degree S = K − 1) of the spline. For N observations we construct N B-splines such that x(ti) = xi for appropriately chosen coefficients ξi. To smooth the data, we choose new coefficients that minimize the penalty function
for some tension parameter λT ≥ 0. If λT = 0 then ϕ = 0 and because x(ti) = xi, but if λT → ∞ then this forces x(t) to a Tth-order polynomial (e.g., when T = 2, the model is forced to be a straight line because it has no second derivative). The resulting path x(t) is known as a smoothing spline and was first introduced in modern form by Reinsch (1967), but according to De Boor (1978) the idea dates back to Whittaker (1923). Once S and T are chosen, the smoothing spline has one free parameter (λT) and its optimal value can be found by minimizing the expected mean-square error when the true value of σ is known (Craven and Wahba 1978).
Three issues must be addressed before smoothing splines are applied to GPS data:
how to choose S and T—and how these choices affect the recovered power spectrum,
how to modify the spline fit to accommodate the non-Gaussian errors of GPS receivers, and
how to identify and remove outliers.
To address these issues but also to serve as a practical guide to other practitioners, we review B-splines in section 2 and introduce the canonical interpolating spline as the underlying model for path x(t) in (1). We demonstrate the effect that choosing S has on the high-frequency slope of the power spectrum of the interpolated fit.
Section 3 takes a broad look at smoothing splines and the assumptions they make on the underlying process. Many of the ideas presented in this section are known to the statistics community, so here we present these ideas from a more physical perspective. We show that the penalty function in (2) can be formulated as a maximum-likelihood problem and that applying tension is equivalent to assuming a Gaussian distribution on the tensioned derivative of the underlying process.
Section 4 uses ensembles from synthetic data that mimic the oceanographic data to test a number of choices that must be made. We establish that setting T = S is a reasonable choice. We show how the tension parameter can be chosen a priori (without optimization of the mean-square error) when the effective sample size (which we define later) can be estimated from the data. This estimate for effective sample size can be used to reduce the coefficients ξi in the spline fit without increasing mean-square error.
The second half of the paper addresses issues specific to GPS position errors. In section 5 we discuss the assumptions of stationarity and isotropy required for bivariate smoothing splines. In section 6 we show that GPS errors are not Gaussian distributed but rather are t distributed, and we show how to modify our method for a t distribution. Section 7 addresses how to modify our method to make smoothing splines robust to outliers. We compare with alternative methods and conclude in sections 8 and 9, respectively.
A major outcome of this work is the implementation of MATLAB classes for generating B-splines, interpolating splines, smoothing splines, and a class specific to smoothing GPS data (https://github.com/JeffreyEarly/GLNumericalModelingKit). These classes are highlighted throughout in relevant sections.
2. Interpolating spline
Assume that we are given N observations of a particle position (ti, xi) with no errors. The simplest form of interpolation is a nearest-neighbor method that assigns the position of the particle to the nearest observations in time. The resulting interpolated function x(t) is a polynomial of order K = 1 (piecewise constant), shown in the top row of Fig. 1. The next level of sophistication is to assume a constant velocity between any two observations and to use that to interpolate positions between observations, as shown in the second row of Fig. 1. This means we now have a piecewise constant function dx/dt that represents the velocity of the particle, shown in the second row, second column, of Fig. 1. This is a polynomial function of order K = 2.
It is less obvious how to proceed to a polynomial of order K = 3. With N data points we can construct a piecewise constant acceleration (the second derivative) using the N − 2 independent accelerations computed from finite differencing, but where to place knot points that define the boundaries of the regions and how to maintain continuity is less clear. The approach taken here is to use B-splines.
A B-spline (or basis spline) of order K (degree S = K − 1) is a piecewise polynomial that maintains nonzero continuity across S knot points. The knot points are a nondecreasing collection of points in time denoted by τi. The basic theory is well documented in De Boor (1978), but here we present a reduced version tailored to our needs.
The mth B-spline of order K = 1 is defined as
This is the rectangle function as shown in the first row, first column, of Fig. 2. Given P knot points we can construct P − 1 B-splines of order K = 1, although if a knot point is repeated it results in a spline that is zero everywhere. To represent an interpolating function x(t) for the N observations of a particle position (ti, xi) we define N + 1 knot points as
This creates N independent basis functions that provide support for the region t1 ≤ t ≤ tN (provided that the last spline is defined to include the last knot point). The interpolating function x(t) is defined as , where the coefficients ξm are found by solving . The result of this process is shown in Fig. 1 for seven irregularly spaced data points.
All higher-order (K > 1) B-splines are defined by recursion,
This creates splines that span across one additional knot point at each order and maintain continuity across one more derivative. Examples are shown in Fig. 2.
Any knot points that are repeated T times result in a total of T − 1 splines of order 1 that are everywhere zero. This has the effect of introducing discontinuities in the derivatives for higher-order splines. For our purposes, we use this feature to prevent higher-order splines from crossing boundaries. For K = 2 order splines we use N + 2 knot points at locations
This creates a knot point at every observation point but repeats the first and last knot point. This has the effect of terminating the first and last spline at the boundary and creating N second-order B splines, . The interpolating function x(t) is defined as , where the coefficients ξm are found by solving . The second row of Fig. 1 shows an example.
This process can be continued to higher-order B-splines. For splines that are of even order, we create N + K knot points with
and, for splines that are odd order, we create N + K knot points with
The knot points are chosen to create N splines for the N data points such that the interpolated function x(t) crosses all N observations (ti, xi). The path x(t) is the canonical interpolating spline of order K. Examples are shown in Fig. 1.
The knot placements in (7) and (8) are equivalent to the not-a-knot boundary conditions described in De Boor (1978) and used in the cubic spline implementation in MATLAB. In the usual formulation of the not-a-knot boundary condition, the knot positions do not change as a function of spline order, and therefore additional constraints must be added at each order—especially the requirement that the highest derivative maintain continuity near the boundaries. In the formulation here, these constraints are implicit in (7) and (8).
b. Numerical implementation
The root class in our suite of MATLAB classes is the BSpline class, which evaluates a complete B-spline basis set given a set of knot points. This class was used to generate Fig. 2.
The interpolating spline used to generate Fig. 1 is implemented in the InterpolatingSpline class—a subclass of BSpline. This class generates interpolating splines of arbitrary order given a set of data points (ti, xi), thus generalizing the cubic spline command built in to MATLAB.
c. Synthetic data
Throughout this paper we generate synthetic data for both the signal and the noise. The velocity of the signal is generated from a bivariate Gaussian process known as the Matérn (Lilly et al. 2017). The spectrum of the Matérn is given by
where ω is the frequency, A sets the amplitude, p > 1 sets the high-frequency slope, and λ sets the frequency below which the signal looks increasingly white. This spectrum has finite amplitude at low frequencies and power-law falloff at high frequencies, two physically realistic properties observed in ocean surface drifters (Sykulski et al. 2016). Trajectories from this velocity spectrum will be generated using the “maternoise” function available in jLab (Lilly 2019). In our experiments, the parameter A is chosen such that the square root of velocity variance in each direction is urms = 0.20 m s−1 and the damping scale is λ−1 = 30 min. Values of p are varied with p = 2, 3, and 4 so that the high-frequency spectrum is proportional to ω−2, ω−3, and ω−4. Velocities are sampled every minute and are integrated to get positions. Figure 3 shows an example velocity spectrum of the signal with p = 2.
The position data are contaminated with (white) Gaussian noise with σ = 10 m, a value chosen to resemble GPS errors. For all experiments we use a range of strides, that is, subsampled versions of the underlying process as input into the spline fits. A stride of 100 indicates that the signal is subsampled to 1 in every 100 data points. This lets us evaluate the quality of fit against different strides. In analyzing the quality of fits, we use velocities when computing the power spectrum, but report mean-square errors from positions.
d. Spline degree S
We first examine a synthetic signal uncontaminated by noise, to examine the role of the spline degree S on the interpolated fit. As noted in Craven and Wahba (1978), the degree of the spline sets its roughness. In terms of the power spectrum, this corresponds to the high-frequency slope as can be seen in Fig. 3, which shows fits with S = 1, …, 4. Setting S = 1 produces a high-frequency falloff in the spline fit of ω−2. Although this appears to be a desirable feature when fitting to a process with true slope ω−2, the mean-square error is consistently higher (as indicated in the legend of Fig. 3).
The bottom panel of Fig. 3 shows the coherence between the spline fit and the true signal. A coherence of 1 indicates that the signals are perfectly matched at a given frequency, while a coherence of 0 indicates that the signals are unrelated. There is no discernible difference in coherence between spline fits with S = 1, …, 4. The coherence quickly drops to near zero at the same frequency in all three cases. The implication here is that the spline fits are essentially producing noise at frequencies above the loss of coherence. This is why the spline fits with shallower slopes (with more variance at high, incoherent frequencies) produce a larger overall mean-square error than those with steeper slopes (with less variance at high, incoherent frequencies). The conclusion here is that smoother is better: it is better to use an unnecessarily high-order spline to avoid adding extra noise at high frequencies.
3. Smoothing spline
A typical starting point for maximum likelihood is to establish the probability distribution function (PDF) of the errors, εi ≡ xi − xtrue(ti). The canonical example in one dimension (e.g., Press et al. 1992) is to assume errors are independently drawn from a Gaussian with the following probability distribution
where σg is the standard deviation. This assumption alone places no assumptions on the signal, only on the structure of the noise.
The probability of the observed data given model x(t) is
where we have taken σ = σg. Maximizing the probability function in (11) is the same as minimizing its argument—this is the logarithmic likelihood (up to a constant), called the penalty function
Stated in this way this is the same as asking for the “least squares” fit of the errors.
a. Smoothing-spline penalty function
The model used here is the canonical interpolating spline of order K described in section 2. We have chosen our knot points such that the model intersects the observations and this certainly maximizes (11) [and minimizes (12)] because all the errors are zero, but the resulting distribution of errors (a delta function at zero) does not resemble the assumed Gaussian distribution. Thus, additional constraints are required if the assumed error distribution is to be recovered.
The smoothing spline augments the penalty function of (12) by adding a global constraint on the Tth derivative of the resulting function as in (2). If λT → 0 then this reduces to the least squares fit in (12), but if λT → ∞ then this forces the model to a Tth-order polynomial.
To interpret the first term of (2), consider a motionless particle at true position x0. Using the N relevant observations xi, the sample mean estimates the particle’s position x0. The unbiased sample variance estimates the variance σ2 of the noise, and is given by , the expected value of which is .
Now consider the opposite extreme in which the particle is moving so fast (or the observations are so sparse) that each observation is independent of its neighbors. In this case, each observation must be considered separately, so the sample mean at time ti is (i.e., we are summing over the single relevant observation). In this scenario we cannot produce a sample variance, because there is only a single relevant observation at time ti.
In practice, the number of relevant observations is anywhere between 1 and N. Here we use the term effective sample size, denoted by neff, to describe the typical number of observations being used to estimate either the particle’s position or the variance of the noise at any given time. In this context, the first term of (2) is proportional to an ensemble of multiple estimates of the sample variance
which is expected to scale as
where is our definition of the effective sample size of the sample variance. Revisiting the limiting cases, as the sample variance matches the true variance; as , the sample variance vanishes.
There is a simple physical interpretation for the second term in (2). Consider the case T = 1 so that the smoothing spline is a constraint on velocity. When averaged over the integration time, the integral produces the root-mean-square velocity urms such that the second term scales as . In general, where is the root-mean-square of the Tth derivative, this means λT scales like
The interpretation of the smoothing spline is that the two terms are balanced by a relative weighting of the sample variance of the noise and mean-square of the Tth derivative of the physical process. As discussed in section 4, both and can be estimated a priori such that a good initial estimate for λT can be made.
b. Smoothing-spline maximum likelihood
The penalty function in (2) can be restated in terms of maximum likelihood under some conditions (see chapter 3.8 in Green and Silverman 1993). Assume that in addition to knowing the distribution of errors as in (11), we know how the velocity of the underlying physical process is distributed. For example, in geophysical turbulence the velocity probability distribution function is like a Laplace distribution (Bracco et al. 2000). To recover the smoothing spline, we consider the case where the velocity PDF is Gaussian. Stated as maximum likelihood, this means at any given instant (and not just the times of observation) we expect the model velocity to be Gaussian. We discretize the problem by sampling the velocity Q times tq = t1 + qΔtq, where Δtq = (tN − t1)/(Q − 1) and q = 0, …, Q − 1. The maximum likelihood is thus stated as
which is the joint probability of the error distribution from (11) and the velocity distribution of the underlying physical process. We include parameter γ to set the relative weighting between the two distributions, although it could be absorbed into the definition of . Writing (16) as a penalty function (after converting the product of exponentials into exponentials of sums), we have
where C is a constant. Setting γ = N/Q and renormalizing the penalty function by 2/N (which has no effect on the location of its minimum), (17) can be written as
There is an important special case when tension is applied at the same order as the spline, T = S. In this case the spline is piecewise constant for x(T) with exactly N − T unique values. The parameter γ = N/(N − T) ≈ 1 and (16) can be simplified. This case is appealing because only the N − T unique values of the derivative x(T) that can be computed from N data points are being used for tension, which is not the case when T < S.
This maximum-likelihood perspective shows that adding tension to the penalty function is equivalent to assuming a higher-order derivative in the model (e.g., velocity if T = 1) is Gaussian. This is therefore making an assumption about the underlying physical process of the model. This is in contrast to the first term, which is entirely a statement about measurement noise.
Writing the smoothing spline as a maximum-likelihood condition (16), suggests that if the underlying physical process has a nonzero mean value in tension, the fit will not behave as expected. However, smoothing splines can be easily modified to accommodate a mean value in tension, as shown in appendix A.
c. Optimal parameter estimation
For a given choice of T and λT, the minimum solution to (2) can be found analytically [see Teanby (2007) and our appendix A]. Once the solution is found the smoothing matrix λ is defined as the matrix that takes observations x and maps them to their smooth values, .
where || ||2 is the Euclidean norm, Tr indicates the trace, and is the identity matrix.
A significant amount of the literature on smoothing splines is devoted to minimizing the mean-square error when the variance σ2 is not known. Craven and Wahba (1978) and Wahba (1978) use cross validation to estimate σ and minimize mean-square error. Recent work comparing different estimators shows no single technique to be optimal (Lee 2003). For our application, however, the errors in GPS data can be relatively easily established, as shown in section 6.
The mean-square error in (19) is a combination of the sample variance and the variance of the mean. As already discussed in the context of the penalty function ϕ in section 3a, the first term in (19) is an ensemble of sample variances, and therefore, by combining (13), (14), and (19) we obtain
The second term in (19) is proportional to 2 times the squared standard error, that is, the variance of the sample mean. As discussed in Teanby (2007), the quantity λΣ is the covariance matrix with the squared standard error along the diagonal and thus the mean-square standard error is given by (1/N)Tr(λΣ). The variance of the sample mean is known to scale inversely with the number of samples being used to estimate the mean. We use this to define the effective sample size of the variance of the mean, with
If one assumes , then the expected mean-square error from (19) is equal to σ2/neff. Although not shown here, in an empirical analysis we find that and are approximately equal, although becomes highly variable when approaches 1. These measures of effective sample size can be used to estimate the value of λT necessary for optimal tension without minimizing the expected mean-square error.
The definition of effective sample size used here is related to, but not the same as, the notion of degrees of freedom used in Cantoni and Hastie (2002) and references therein.
4. Spline order, tension order, and the spectrum
With a model path [(1)], a penalty function [(2)], and a minimization condition [(19)], we have all of the primary pieces to create a smoothing-spline interpolant to the data. However, a number of choices still must be made. In this section we use synthetically generated data to represent our physical process, and contaminate the process with Gaussian noise as described in section 2c. We test our ability to recover the signal and examine the effects of changing the spline and tension order on the mean-square error and the resulting spectrum.
The results of this section are empirical, and we acknowledge upfront that any conclusions reached may depend on our particular choice of physical model generating the signal. Nevertheless, our expectation is that the conclusions are “O(1)” correct and are applicable, at least, to our GPS-tracked drifter dataset.
a. Tension degree T
Given a smoothing spline of degree S, the tension in the penalty function (2) can be applied at any degree T ≤ S. We use the synthetic data for the three different slopes to empirically establish the relationship between the tension degree T and the spline degree S.
For S = 1, …, 5 and all T ≤ S we minimize the mean-square error against the true values. The minimization is performed for 200 ensembles of noise and signal with three slopes (ω−2, ω−3, and ω−4) and five different strides. For a given slope, stride, and realization of noise, we identify the minimum mean-square error across S and T and compare all values of S and T as a percentage increase relative to that minimum. After aggregating across slopes, strides, and ensembles, the 68% confidence range is shown in Table 1. The table shows that setting T = S is not always optimal but it is never significantly worse than the optimal choice. Thus for the remainder of the paper we set T = S.
b. Loss of coherence
The loss of coherence defines the time scale below which the smoothing spline is not providing useful information. A reasonable hypothesis is that this scale is related to the effective sample size, neff because this indicates how many points are being used to estimate the true value. Therefore, the loss of coherence occurs at the effective Nyquist, which we define as
In practice, we use because it is less variable than for values near 1 and is the more direct measure of how many points are being used to estimate the model path. Figure 4 shows the power spectrum and coherence of optimal tension fits for three different strides of the data. In each case (23) indicates the approximate value where the coherence drops below 0.5.
c. Reduced spline coefficients
One practical consideration when working with large datasets is the computational cost of creating the spline fit, which is limited by the rate of solving for the spline coefficients. It is beneficial to reduce knot points (and therefore total splines) where possible. A reasonable strategy is that when the effective sample size is large, as measured by (21), we avoid placing a knot point at every data point—essentially “skipping” data points.
To test this idea, we find the optimal fit over a range of different strides (which varies the effective sample size) and increase the number of skipped knot points until the mean-square error starts to rise. We find that we can skip max[1, floor(2neff/3)] knot points without sacrificing precision. The column labeled “optimal mse” in Table 2 indicates the optimal fit in which one knot point is created for every observation point, whereas the reduced degrees of freedom (“reduced dof”) column indicates a fit in which the number of knot points is reduced. In some cases the optimal mean-square error improves with fewer knot points. This means that, when handling large datasets, we can reduce the number of splines being used if the effective sample size is large, and we can “chunk” the data (split them into multiple independent pieces) when the effective sample size is small.
d. Interpolation condition
To estimate λT from (15), we estimate the mean-square value of a derivative of the process, (see appendix C) and the effective sample size, neff. We argue that effective sample size should vary based on the relative size of the measurement errors to the speed of motion. For example, if the position errors are only 1 m, but a particle typically travels 10 m between measurements, then it is hardly justifiable to increase the tension so that the smoothing spline misses the observation points by 1 m. There is not enough statistical evidence to suggest that the particle did not go right through the observation point. On the other hand, if the position errors are 1 m, but the particle typically travels 10 cm between measurements, nearby measurements provide more information about the particle’s true position during that time, so our estimate of the particle’s true position is closer to a mean of the nearby observations.
This idea can be made more rigorous by stating that a change in position, Δx, is statistically significant if it exceeds the position errors σ by some factor. Assuming the physical process has a characteristic velocity scale, urms, we use this concept to define Γ as
where Δt is the typical time between observations. This argument suggests that effective sample size should be proportional to Γ; that is,
where C and m are unknown constants, and we prevent the effective sample size from dropping below 1. Intuitively this means as long as the particle does not move too far between observations, nearby observations help to estimate the true position of the particle.
To test the relationship between Γ and effective sample size, we compute the optimal smoothing spline for a range of values of Γ (created by subsampling the signal) for three different spectral slopes (ω−2, ω−3, and ω−4). The value is computed from the optimal solution for 50 ensembles and shown in Fig. 5. The fits are remarkably good, but depend on the slope. Processes with shallower slopes (rougher trajectories) yield a smaller effective sample size for a given value of Γ.
Using the interpolation condition Γ to estimate effective sample size, we set , the empirically determined best fit for slope ω−3. For all spline fits we use
as an initial estimate for the optimal smoothing parameter, where in (26) and urms in (24) are estimated using the method described in appendix C. The scaling law for can be estimated analytically. Let position observations be given by xi, where
If the effective sample size is ⟨n⟩, then the particle changes position by ⟨n⟩urmsΔt between samples. Applying the two-sample z test, two positions will be considered different for z > zmin, where
e. Optimal fits
Table 2 summarizes the key results of this section by applying a smoothing spline (S = 3) to 200 ensembles with three different slopes (ω−2, ω−3, and ω−4) and five different strides. When the algorithm uses true values, uncontaminated by noise, we consider the process to be “unblinded,” in contrast to “blind” methods, in which the algorithm only uses noisy data. The second and third columns show the effective sample size and average mean-square error when the smoothing spline is applied using the true values (i.e., unblinded) to minimize the mean-square error—this is the lower bound. The fourth column shows average increase in mean-square error when reducing the number of spline coefficients as documented in section 4c. There is almost no change in mean-square error, and therefore, all subsequent methods (whether blind or unblind) use this technique. The fifth column uses (26) from section 4d to provide a (blind) initial guess of the tension parameter. The results are mixed—a typical increase in mean-square error is 30%–50% when the effective sample size is large. While this seems large, this is a small fraction of the total noise variance; for example, an optimal mean-square error of 6 m2 increases to 8 m2 when the total variance is 100 m2. Nearly optimal fits can be found using (19), as shown in the last column of the table.
f. Numerical implementation
The numerical implementation of the methods in this section are available in the SmoothingSpline class, which subclasses BSpline. This class is initialized with three required parameters: a set of data points (ti, xi) and an error distribution.
5. Bivariate smoothing splines and stationarity
Up to this point we have considered univariate data, (ti, xi), but GPS position data are fundamentally bivariate. The term “bivariate” in the context of splines is often used to denote splines defined on two independent variables—however, in this context we define bivariate to mean two dependent variables (e.g., x and y) and one independent variable (e.g., t).
The trivial approach to working with bivariate data is to treat each direction independently—that is, minimize and independent of each other. However, the underlying physical process is often isotropic. In the context of the maximum-likelihood formulation of smoothing splines in (18), this means we expect (the rms value of the tensioned variable) to be the same in all directions (invariant under rotation). This however does not mean that λx should necessarily equal λy. To be explicit, if
then, even if , the effective sample sizes and may differ if there is any mean velocity because, as shown in section 4d, effective sample size depends on velocity.
Therefore, to assume isotropy in λT and use a bivariate smoothing spline, the mean velocity from the underlying process must be removed. What qualifies as mean and fluctuation rarely has a clear answer, but a reasonable option is letting a polynomial of degree T + 1 define the mean. This has the added benefit of removing a constant nonzero tension value, which as shown in section 3b, changes the problem formulation.
It is stationarity, not isotropy, that requires removing the mean velocity. The effective sample size is shown to be dependent on rms velocity, so if velocity varies in time, then the optimal effective sample size varies as well. This means not only do smoothing splines require stationarity in the tensioned variable x(T) as shown in section 3b, but they also require stationarity in the velocity x(1) to be effective. This last requirement can be solved by either removing the mean (as suggested here), or segmenting observations into locally stationary chunks.
a. Assessing errors
Removing the mean or some other low-passed version of the data means the total smoothing matrix is a combination of the low-passed and high-passed smoothing matrices. Once this matrix is computed, it can be used to compute the standard errors.
We first create a low pass filter to capture the mean component of the flow using a simple polynomial fit and then define the residual as our stationary part, . We now compute the smoothing spline as usual on the residual, . So the total, smoothed path is
From this we can compute the covariance matrix and the standard error.
b. Numerical implementation
The BivariateSmoothingSpline class is initialized with data (ti, xi, yi) and a distribution. For a spline of degree S = T, a spline of degree S + 1 is used to remove the mean in each direction. With a Gaussian distribution this is simply a least squares polynomial fit. By assumption, the residual data are stationary and isotropic, so the tension parameter λT is applied equally in each direction. Minimization is performed on the sum of the expected mean-square errors in each direction.
6. GPS dataset
The primary dataset considered here is nine surface drifters deployed in the Sargasso Sea in the summer of 2011 (Shcherbina et al. 2015). In the past, such drifters used the Argos positioning system, which has significantly poorer temporal coverage and position accuracy (Elipot et al. 2016), but recently most surface drifters have employed GPS receivers and transmitted their data back through Argos or Iridium satellites.
The GPS receiver sits on the surface drifter and collects position data, but because of atmospheric conditions or ocean waves, the receivers are sometimes unable to obtain a position, or when they do, it is highly inaccurate. Despite nominal accuracies of a few meters, it is often the case that some positions are off by more than 1000 m, as can be seen in Fig. 8. Applying a smoothing-spline fit using the method in section 3 produces an extremely poor fit, with clear overshoots to bad data points.
GPS error distribution
We characterize the GPS errors by considering data from a motionless GPS receiver allowed to run for 12 h. The GPS receiver used in this test is not the same as the one used for the drifters (because it was no longer available) but should produce errors similar enough for this analysis.
The position recorded by the motionless GPS are assumed to have isotropic errors with mean zero, which means the positions themselves are the errors. The PDF of the combined x and y position errors are shown in Fig. 6.
The error distribution is first fit to a zero-mean Gaussian PDF (10). The maximum-likelihood fit is found by computing the standard deviation of the sample, which is found to be σ ≈ 10 m and shown as the gray line in Fig. 6. However, it is clear the error distribution shows much longer tails than the Gaussian PDF.
The Student’s t distribution is a generalization of the Gaussian that produces longer tails and is defined as
where the σs parameter scales the distribution width and the ν parameter sets the number of degrees of freedom [as ν → ∞, (31) becomes a Gaussian]. The variance is and only exists for ν > 2. Minimizing the Anderson–Darling test to find the best-fit t distribution to the data, we find parameters σs ≈ 8.5 m and ν ≈ 4.5 shown as the black line in Fig. 6. Different choices in GPS receivers and using the Kolmogorov–Smirnoff test results in very similar parameters; that is, σs ≈ 8–10 m and ν ≈ 4–6.
The position error distributions imply a combined distance error distribution by computing and is shown in the lower panel of Fig. 6. With Gaussian noise this results in a Rayleigh distribution,
as shown by the gray line. The distance error distribution from t-distributed noise is computed numerically and is shown by the black line. Around 95% of distance errors are within 30 m.
Figure 7 shows the autocorrelation function of the GPS position errors and the 99% confidence intervals. We find a rough empirical fit to be ρ(τ) = exp[max(−τ/t0, −τ/t1 − 1.35)], where t0 = 100 s and t1 = 760 s, which reflects an initially rapid falloff in correlation, followed by a slower decline. The smallest sampling interval of the GPS drifters in question is 30 min and the correlation indistinguishable from zero according to Fig. 7. It is therefore safe to assume the errors are uncorrelated for our real-data example. Although the drifter sampling rate allows us to avoid further discussion of the autocorrelation function of GPS errors, accounting for autocorrelation is a relatively easy extension (and is implemented in the code).
The smoothing-spline algorithms described in section 3 are modified to use the t distribution as described in appendix B. Table 3 shows that the conclusions reached for Gaussian data in section 3 still apply with t-distributed data.
7. Minimization with outliers
The goal here is to find a smooth solution in the presence of outliers—points that do not appear to be of the known error distribution for the GPS receiver shown in section 6. These points are obviously problematic as can be seen in Fig. 8, where individual data points jump hundreds of meters and even several kilometers away from its neighbors. Errors of this size are inconsistent with the noise analysis of the preceding section, so the goal here is to find a model path x(t) robust to this uncharacterized noise. What makes outliers “obvious” to the eye is they appear as unexpectedly large motions, inconsistent with the other motion for that path. The smoothing-spline formulation is therefore useful, as it assumes the motion at some order (e.g., acceleration) is Gaussian, as shown in section 3b. In the nine drifters we are analyzing here, one drifter shows no obvious outliers, suggesting the issue may be related to how the antennas are configured. This particular drifter serves as a useful point of comparison.
Minimizing with the expected mean-square error (19) produces a fit so poor it is not worth showing. Because outliers add enormous amounts of variance, the expected mean-square error vastly underestimates the spline tension—essentially chasing every outlier shown in Fig. 8. Because some of the noise is uncharacterized, this suggests using a method such as cross validation might be effective. The orange line in Fig. 8 uses a smoothing-spline fit, assuming Student’s t distributed errors, but minimized with cross validation. This fit performs relatively well, but, when compared with the drifter 7, it is clear it still chases some outliers. The goal in this section is to develop a method robust to outliers in cases where we know something about the noise.
The basic problem formulation is as follows: we define a new “robust distribution,” probust, that includes the known noise distribution, pnoise, plus an unknown (or assumed) form of an outlier distribution, poutlier,
We consider a t distribution for pnoise with parameters found from the GPS errors in section 6. The distribution of poutlier is also set to be a t distribution, but with ν = 3 and σ = 50σgps, which roughly matches the total variance of the observed outliers. In our tests we varied α from 0 up to 0.25, approximately the range of observed outliers from the drifter datasets.
Throughout our attempts to smooth the noisy GPS data we tried many different approaches to modifying smoothing splines for robustness to outliers, but ultimately found enormous gains are made by simply discarding outliers while minimizing the expected mean-square error (19). The results of this approach are shown in section 7a, and we document our method to reliably estimate the outlier distribution in section 7b.
a. Robust minimization
The challenge with outliers is we do not know their distribution, so minimizing the expected mean-square error using (19) with the expected variance from the robust distribution defined in (33) does not work. Outliers add extra variance and therefore cause the spline to be undertensioned (λT too small). Our method excludes the outliers from the calculation of (19), where outliers are defined as points unlikely to arise with the known noise distribution. The ranged expected mean-square error replaces σ2 with
and discards all rows (and columns) of λ where (λ − )x < cdf−1(β/2) or (λ − )x > cdf−1[1 − (β/2)].
To test this approach we generated data as before but allowed a certain percentage of outliers α to be generated with an outlier distribution following (33). We considered five values of β (1/50, 1/100, 1/200, 1/400, and 1/800) as well as β = 0, which is just (19). Testing across a number of ensembles with outlier ratios α = 0.0, 0.05, 0.10, and 0.25 we found that β = 1/100 is overall the best choice.
b. Full-tension solution and outlier distribution
The full-tension solution is defined as the maximum allowable value of λ given the known noise distribution. That is, the spline fit is pulled away from the observations so that the distribution of observed errors xi − x(ti) matches the expected distribution pnoise(ε). In cases where the effective sample size neff is large, the full-tension solution approximately matches the optimal (minimal mean-square error) solution. In cases in which the effective sample size is small, the full-tension solution is more akin to a low-pass solution [as increasing λ is equivalent to decreasing ].
In the simplest case where there are no outliers, the full-tension solution can be found by requiring the sample variance match the variance of pnoise(ε). When outliers are present, a more robust method of estimation is required. After some experimentation, we found the most reliable method of achieving full tension is to minimize the Anderson–Darling test of pnoise(ε) on the interquartile range of observed errors. This method can be used to estimate the outlier distribution and further refine both the full-tension solution and the range over which the expected mean-square error is computed.
The outlier distribution is estimated as follows. We first assume the outlier distribution follows a t distribution with ν = 3 and α < 0.5. If the spline is in full tension, then the observed total variance can be used to find σo for the outlier distribution. From (33),
which, given some α, can be solved for σo. Our method uses 100 values of α logarithmically spaced from 0.01 to 0.5 and chooses the value that minimizes the Anderson–Darling test. With an estimate for probust(ε), the full-tension solution can be refined by minimizing the Anderson–Darling test of probust(ε) on the interquartile range of observed errors. This iterative process converges very quickly to a good estimate for the outlier distribution and the full-tension solution.
c. Extension to bivariate data
The strategies in this section are relatively easily extended to bivariate data. All error distributions are assumed isotropic, and the outlier distribution can be estimated by including the errors from both independent directions. The ranged expected mean-square error calculation defined in section 7a uses the distance of the error for its cutoff to remain invariant under rotation.
Application of this method to one of the GPS drifters (drifter 6) is shown in Fig. 8. Although it is impossible to know exactly how well the spline fit performed, comparison with drifter 7 (with no apparent outliers) suggests our method successfully avoids chasing outliers.
d. Numerical implementation
The GPSSmoothingSpline inherits from the BivariateSmoothingSpline class and assumes errors follow a t distribution found in section 6. The class projects latitude and longitude using a transverse Mercator projection with the central meridian set to the center of the dataset.
The methods discussed in this paper are related to other methods used to smooth and interpolate drifter trajectories.
Yaremchuk and Coelho (2015) formulate a cost function, their (9), based on PDFs of the drifter accelerations and the GPS errors. Setting their μ = 1 (they choose μ = 0.9) this is equivalent to the special case of (18) when S = T = 2, where they have implicitly chosen λT by assuming an infinite effective sample size, neff. Their method for isolating outliers is nearly equivalent to the iteratively reweighted least squares method detailed in appendix B using a weight function similar to Tukey’s biweight, (B6).
Elipot et al. (2016) apply their method to the Argos-tracked surface drifters, which are significantly noisier positions than GPS errors but also follow a t distribution. They assume a linear model for positions, equivalent to assuming S = T = 2 with λT → ∞. In the numerical implementation of this paper, this special case is implemented in the ConstrainedSpline class. The time-dependent weight function used in Elipot et al. (2016) requires manually specifying a weight for each point used, and this method is therefore somewhat different than the approach taken here.
Another technique used for smoothing and interpolating drifter positions is kriging (Hansen and Poulain 1996); however, its relationship to smoothing splines is less clear. In response to a study empirically comparing kriging with smoothing splines (Laslett 1994), Handcock et al. (1994) point out that kriging and smoothing splines are just two specific parameter choices of a more general class of splines defined by their covariance functions. In the context of the maximum-likelihood equation for smoothing splines [(18)], this generalization could be modeled by including a covariance structure on the physical process.
Overall, the method of this paper (in a loose sense) generalizes a number of existing approaches for interpolation, especially in terms of flexibly allowing different levels of smoothness and tension, and in terms of application to non-Gaussian noise structures.
The method in this paper solves our problem of finding smoothed, interpolated positions from a noisy GPS drifter dataset with outliers. In more general terms, for signals with second-order structure similar to a Matérn process we found that
For GPS data, there appear to be three key steps for using smoothing splines:
Thanks are given to Miles Sundermeyer, whose drifters were used in this analysis. The work of J. J. Early was funded by ONR through the Scalable Lateral Mixing and Coherent Turbulence Departmental Research Initiative (LatMix) and National Science Foundation Award 1658564. The work of A. M. Sykulski was funded by the Engineering and Physical Sciences Research Council (Grant EP/R01860X/1).
The B-splines are generated using the algorithm described in De Boor (1978) with knot points determined by (7) and (8). The matrix with components denotes the mth B-spline at time ti. The column vector ξm represents the coefficients of the splines such that positions at time ti are given by , where .
The smoothing spline condition in (16) can be augmented to include a nonzero mean tension μu,
where we have taken T = 1 for this calculation. The discretized penalty function is
where Σ denotes the covariance matrix describing the measurement errors and we absorbed several constants into λ1. The matrix is the spline velocity matrix such that is the model velocity at time tq. To find the coefficients that minimize this function, we take the derivative with respect to ξ, set it to zero, and solve for ξ,
where ι is a vector of 1s. The operation ι essentially integrates the m splines and results in a column vector with the integrated values.
We define the smoothing matrix as the linear operator that takes observations x to their smoothed values , . From this definition and (A3),
when μ = 0.
Iteratively Reweighted Least Squares
Using the t distribution is challenging because it does not result in a linear solution for the coefficients as in (A3). One solution is to use a search algorithm to directly look for maximum values. Alternatively, one can use iteratively reweighted least squares (IRLS).
The idea with IRLS is to reweight the coefficients of the Gaussian, σg in (10), so that the resulting distribution looks like the desired distribution, for example, (31). With the recollection that εi ≡ xi − x(ti, ξ), the minimization condition dpg/dξ = 0 implies that
for the Gaussian distribution, whereas for the t distribution this implies
This means that one can set
to get a matching distribution. Of course, this is only true if εi is already known, which initially it is not. So the method becomes iterative—one starts with εi determined from the Gaussian fit and then determines a new εi after reweighting σg. This method iterates until σg stops changing. We can rewrite (B3) as a function of εi,
From (B4) it is clear that if εi < σs then it is reweighted to a smaller value, making the observation point more strongly weighted. On the other hand, if εi > σs, then its relative weighting decreases, and it is treated more as an outlier.
In more general terms, the weight function w(z) for a PDF p(z) is found by setting −∂z logp(z) equal to −∂z logpg(z) of a Gaussian PDF, where w(z) replaces , and then solving for w(z). The result is
The same strategy could be used to reshape the PDF of a Gaussian to match the desired distribution, but here we simply match the minimization conditions of the PDFs.
As a point of reference, Tukey’s biweight is given by
which as a weight function is
In a practical sense, Σ−1 in (A4) is replaced with the diagonal matrix ≡ diag[1/w(εi)] populated with the reweighted values for each observation such that
This operator is used to compute the standard error from the variances, λΣ, where the variance is assumed to be for each observation when using a t distribution.
The smoothing-spline solution does depend on the initial value of w(εi) used in the IRLS method. However, we find that for uniform initial weightings (e.g., all values start with the square root of the variance), the differences are not statistically significant from other initial values.
Estimating the Variance of the Signal
Our method requires good estimates of the root-mean-square velocity urms of the signal, to determine the effective sample size and variance of the tensioned derivative. Our approach is to compute the power spectrum of the signal at the derivative of interest, and sum the variance that is statistically significantly greater than the expected variance of the noise.
Given a process observed with values xn at times tn = nΔ where n = 1, …, N, we estimate the mean of its mth derivative by performing a least squares fit to the polynomial . The detrended time series is defined as . The power spectrum of this time series is
where the frequencies fk are given by fk = k/(NΔ). By Plancherel’s theorem,
The power spectrum of the mth derivative of the process is computed as
It is important to detrend the signal prior to computing the derivative because, by assumption, the signal is periodic and has no secular trend.
The noise, εi, has total variance . Because the noise is assumed to be uncorrelated, the variance distributes evenly across all frequencies. The spectrum of the noise is therefore
which immediately satisfies Plancherel’s theorem (C2). The mth derivative of the noise has power spectrum
The technique used here sums the variance of the signal for a given frequency if it exceeds the expected variance of the noise at the frequency by some threshold. The estimate of power at each frequency follows a χ2 distribution with 2 degrees of freedom, so we choose the threshold based on the 95th percentile of the expected distribution. And thus,
where q ≈ 20 for the 95th-percentile confidence.
Denotes content that is immediately available upon publication as open access.
This article is included in the LatMix: Studies of Submesoscale Stirring and Mixing Special Collection.