## Abstract

A fully nonparametric (NP) version of the sea state bias (SSB) estimation problem in radar altimetry was first presented and solved by Gaspar and Florens (GF) using the statistical technique of kernel smoothing. This solution requires solving a large linear system and thus comes with a significant computational burden. In addition, examination of GF SSB estimates reveals a marked bias close to the boundaries of the estimation domain. This paper presents efforts to improve both the skill and the computational efficiency of the GF SSB estimation method. Computational efficiency is rather easily improved by an appropriate kernel choice that transforms the linear system to be solved into a very sparse system for which fast solution algorithms exist. The estimation bias proves to be due to the GF choice of a rudimentary NP estimator for conditional expectations. Use of a more elaborate estimator appears to be possible after a slight adaptation of the method. This solves the bias problem. Further improvement of the estimation skill is obtained by a local tuning of the kernel bandwidth. The refined estimation method is finally used to obtain a new NP estimate of the TOPEX SSB. This estimate yields larger SSB values than most previous estimates, in better agreement with recent in situ observations.

## 1. Introduction

It has long been known that the sea level measured by radar altimeters is lower than the true sea level because the backscattered power per unit area is greater for the wave troughs than for the wave crests (Yaplee et al. 1971). This effect is known as the electromagnetic bias. Other, often smaller, sea state–related biases affect the altimeter range measurements (e.g., Chelton et al. 1989). These are generally combined with the electromagnetic bias to form the sea state bias (SSB). This bias ranges between a few centimeters and a few decimeters. Its physical modeling remains a theoretical challenge (e.g., Rodriguez et al. 1992; Elfouhaily et al. 1999, 2000). Therefore, its estimation still largely relies on empirical models, calibrated on the altimeter data themselves (e.g., Born et al. 1982; Zlotnicki et al. 1989; Fu and Glazman 1991; Ray and Koblinsky 1991; Gaspar et al. 1994; Chelton 1994; Glazman et al. 1994). These models assume that the SSB can be assigned a parametric form:

where *ϕ* is an a priori specified function of **x**, a vector of pertinent sea state–related variables, and *θ*, a vector of parameters. A simple polynomial form of *ϕ* is generally assumed and the components of **x** are always chosen among the few sea state–related parameters that can be directly measured by an altimeter: the significant wave height (SWH), the surface wind speed (*U*), the backscatter coefficient (*σ*_{0}), or any combination thereof. A least squares fitting procedure is then used to estimate the value of *θ* that minimizes the variance of the measured sea surface height (SSH) differences at crossover points or along collinear tracks. However, Gaspar and Florens (1998, hereafter referred to as GF) demonstrated that so-fitted models are not true least squares approximation of the SSB. They differ from it by typically a few millimeters to a few centimeters, depending on sea state. This problem proves to be a consequence of the inevitably imperfect specification of the model's parametric form that corrupts the fitting process performed on SSH differences rather than directly on SSH measurements.

To avoid specification of a parametric formulation of the SSB model, GF reformulated and solved the SSB estimation problem in a totally nonparametric way based on the statistical technique of kernel smoothing. This method was used to compute the first fully nonparametric (NP) estimate of the TOPEX SSB as a function of *U* and SWH. This estimate was obtained using (i) a very limited dataset, (ii) an elementary NP estimator for conditional expectations, and (iii) a crude selection of smoothing parameters. Still, it appears to be better than the previously published parametric estimates.

Restriction to a limited dataset was justified on the ground of computational limitations. The solution of the NP SSB estimation problem indeed requires solving a large linear system where size is proportional to the number of observations used for estimation. Therefore, GF only used a small fraction (less than 10%) of the available data to obtain a TOPEX SSB estimate while keeping the computation time within reasonable limits. This demonstrated the pertinence of the NP approach, but further analysis or refinement of the results was not attempted in such a computationally limited context. This limitation is especially frustrating as the SSB is a small signal to be extracted from altimetric measurements that contain other much stronger signals acting as a large noise in the estimation process. Use of the largest possible number of data is thus highly desirable to help reduce the impact of the noise without having to unduly increase the amount of smoothing.

Another weak point of the GF solution is the use of the so-called Nadaraya–Watson (NW) kernel estimator for conditional expectations. Use of this estimator, actually an elementary locally weighted average, eased the derivation and implementation of the GF solution. But the NW estimator is known to yield significantly biased estimates in regions where the data density varies rapidly (e.g., Chu and Marron 1991). This problem likely affects the GF SSB estimation since altimetric observations are far from homogeneously distributed in the (*U*, SWH) plane. In particular, large data density gradients are present close to the boundaries of the estimation domain.

This paper describes our efforts to improve the skill of the GF SSB estimation method and to overcome its computational limitations. In section 2, we briefly revisit the GF SSB estimation method and show that it is easily generalized to accommodate a wide range of NP estimators of conditional expectations among which a replacement for the NW estimator can be chosen. We present, in section 3, the numerical setup and the datasets used for testing improvements of the estimation method. Methodological improvements are theoretically investigated and numerically tested in section 4. The best estimator is finally used to obtain a new NP estimation of the TOPEX SSB, which is analyzed in section 5.

## 2. Revisiting nonparametric estimation of the sea state bias

### a. Principle

Statistical estimation of the SSB is classically based on the following observation equation:

where indices 1 and 2 denote measurements taken at different times *t*_{1} and *t*_{2} but at the same geographic location; SSH′ is the sea surface height above the reference ellipsoid, not corrected for the SSB; and *ɛ* is a zero-mean error term that includes altimetric measurement errors, residual geophysical correction errors, and the dynamic topography variation between *t*_{1} and *t*_{2}. The observations SSH^{′}_{2} − SSH^{′}_{1} can be either crossover or collinear differences.

In this section, we briefly revisit, and slightly generalize, the method devised by GF to nonparametrically estimate the SSB based on series of altimetric observations (2). As a starting point, GF simply assume that the SSB is a nonspecified function *ϕ* of **x**, a vector of *p* pertinent variables. In practice, GF use **x** = (*U*, SWH) but the theory holds for any set of sea state–related variables. Setting *y* = SSH^{′}_{2} − SSH^{′}_{1}, (2) can be rewritten:

so that the conditional expectation of *y*, given **x**_{2} = **x**, is

Nonparametric estimation theory provides a variety of estimators of the regression function *r*(**x**) = *E*[*ζ* | **x**], where *ζ* is any random scalar variable jointly distributed with **x**, (see, e.g., Härdle 1990). Gaspar and Florens directly selected the simple Nadaraya–Watson kernel estimator (Nadaraya 1964; Watson 1964). But one can remain more general and simply assume that the selected estimator can be written under the form

where the weights *α*_{n} have to satisfy

Using such an estimator and a set of *n* altimetric crossover or collinear measurements (*y*_{i}, **x**_{1i}, **x**_{2i}), the conditional expectations in (4) can be estimated to yield

Indices 1 and 2 still denote measurements taken at different times (*t*_{1} and *t*_{2}) but at the same location, while index *i* simply denotes the *i*th measurement in the dataset. Equation (7) provides a practical means of estimating *ϕ*(**x**), for any **x**, provided that the values of *ϕ*(**x**_{1i}) are known. To estimate these, one rewrites (7) for **x** equal any of the **x**_{1i}:

or, in matrix form:

where 𝗜 is the (*n* × *n*) identity matrix, 𝗔 is the (*n* × *n*) matrix with elements *a*_{ji} = *α*_{n}(**x**_{1j}, **x**_{2i}), *ϕ*^{T}_{1} = [*ϕ*(**x**_{11}), … , *ϕ*(**x**_{1n})], and **y**^{T} = [*y*_{1}, … , *y*_{n}].

The estimation problem is now in a closed form, but the linear system (9) cannot be solved for *ϕ*_{1} since 𝗜 − 𝗔 is singular. Indeed, the definition of the weights (6) implies that the sum of elements, in each line of this matrix, is zero. The rank of 𝗜 − 𝗔 is thus equal to *n* − 1. This is the numerical expression of the obvious undetermination problem linked to observation equation (2): only SSB differences are observed so that the SSB can only be determined to within a constant. To eliminate this undetermination we have to impose (rather than estimate) one value of the SSB. In this case, we impose the value of one of the components of *ϕ*_{1}, say, *ϕ*(**x**_{11}) as the dataset can always be rearranged to have the selected component in first place. The constraint is thus

so that (9) can be rewritten

where *ϕ* is the vector made of the *n* − 1 components of *ϕ*_{1} that remain to be determined: *ϕ*^{T} = [*ϕ*(**x**_{12}), … , *ϕ*(**x**_{1n})] and (𝗕_{0}, 𝗕_{1}) is a partition of 𝗜 − 𝗔, 𝗕_{0} being simply the first column of 𝗜 − 𝗔. We now have a system of *n* equations with *n* − 1 unknowns. A linear least squares solution is immediately obtained:

This provides the missing estimates of *ϕ*(**x**_{1i}), *i* = 2, … , *n*. These can then be plugged into (7) to estimate *ϕ*(**x**) for any value of **x**. The NP SSB estimation problem is thus formally solved. This solution is that of GF, with a slight generalization concerning the form (5) of the used NP estimator for the conditional expectations.

### b. Selection of a nonparametric estimator

As previously mentioned, GF used the simple Nadaraya–Watson kernel estimator of the regression function *r*(**x**) = *E*[*ζ* | **x**]. This estimator determines *r̂*(**x**) as a simple weighted average of the measurements *ζ*_{i}. The weighting scheme is governed by a so-called Kernel function (*K*) that assigns weights as a function of a scaled distance 𝗛^{–1}(**x** − **x**_{i}) between **x**_{i} and **x**. In its general multivariate form, the NW estimator reads

with

where | 𝗛 | is the determinant of 𝗛, the (*p* × *p*) nonsingular scaling matrix, generally called bandwidth matrix. Individual scaling of each component of **x**_{i} − **x** is usually applied so that 𝗛 is, in most practical cases, a diagonal matrix. The kernel function *K* is a nonnegative scalar function normalized to satisfy ∫ *K*(**x**) *d***x** = 1. It is usually chosen to be a symmetric probability density function.

The complete specification of the NW estimator, or any other kernel estimator, is thus based on three elements: the formulation of the estimator [e.g., Eq. (13)], the choice of a Kernel function *K*, and the determination of the bandwidth matrix 𝗛. As will be shown in section 4, careful selection of each of these three elements can help significantly improve the skill and the computational efficiency of the GF SSB estimation method.

## 3. Numerical setup

### a. Datasets

#### 1) TOPEX data

Improvements of the NP SSB estimation technique are conveniently checked using the TOPEX crossover dataset previously gathered by GF. This dataset of over 600 000 valid crossover differences, includes 100 cycles of TOPEX side A altimetric measurements obtained between 4 April 1993 and 15 April 1996. Data quality is homogeneous throughout this period. In particular, the dataset does not include the very first repeat cycles nor the end of the TOPEX side A operation period, during which a progressive degradation of the measurements was observed.

The crossover differences (*y* = SSH^{′}_{2} − SSH^{′}_{1}) correspond to the descending arc measurement minus the ascending arc measurement, inside the same repeat cycle. Therefore, in the rest of this paper, index 1 (index 2) denotes measurements taken on the ascending (descending) arc. Each crossover difference is associated with the corresponding altimeter-derived wind speed and wave height measurements (*U*_{1}, SWH_{1}, *U*_{2}, SWH_{2}), where *U* is computed using the modified Chelton–Wentz algorithm (Witter and Chelton 1991a). More details about this data set (hereafter denoted as TP dataset) and its quality control can be found in GF.

#### 2) Simulated data

The skill of SSB estimators cannot be precisely evaluated using the TP dataset alone. Indeed, classical diagnostics such as analysis of estimation residuals or of the explained variance can only be applied to crossover residuals, that is, to crossover differences corrected for the estimated SSB. As explained by GF, an SSB estimate that minimizes the variance of crossover residuals is not necessarily a true least squares approximation of the SSB. In addition, diagnostics on crossover residuals can only be related to wind speed and wave height differences (Δ*U*, ΔSWH) at the crossover points, but not to one specific value of (*U*, SWH). Any problem detected in the residuals is thus difficult to trace back to the corresponding estimation problem in a well-identified zone of the (*U*, SWH) plane.

To allow easier and more detailed evaluation of our new SSB estimates, we created a set of simulated crossover measurements containing only a known SSB signal plus some noise. SSB estimates obtained using this simulated (SIM) dataset can be directly compared to the known “true” SSB and estimation errors are directly diagnosed in the (*U*, SWH) plane.

The SIM dataset is simply obtained by replacing, in the TP dataset, the actual crossover differences (SSH^{′}_{2} − SSH^{′}_{1}) by simulated values *z*_{2} − *z*_{1}:

where *w* is a simulated noise, and *B* is an analytical expression of the SSB as a function of (*U*, SWH). For this purpose, we use the BM4 model of Gaspar et al. (1994) fitted over the TP dataset; that is,

For simulation realism, the standard deviation of the random noise *w* varies regionally and is taken to be the actual standard deviation of the TP crossover residuals, after correction for the BM4 SSB. This standard deviation varies from 3 cm in the quietest ocean regions up to about 20 cm in the most energetic areas.

### b. Numerical solution procedure

To solve (12), we first have to choose the measurement **x**_{11}, where the constraint (10) is imposed. For the TP dataset, the only obvious choice is to impose that the SSB is zero over a flat sea with no wind, that is *ϕ*(0, 0) = 0. However, the data density close to (0, 0) is so weak (Fig. 1) that imposing the constraint in that neighborhood can create numerical problems. As GF suggested, a numerically effective procedure to satisfy this constraint is the following.

Select the ascending arc measurement

**x**_{11}closest to (*U*_{m}, SWH_{m}), the mean of the (*U*, SWH) distribution. In the TP dataset, this mean is 8 m s^{–1}, 2.7 m.Impose

*ϕ*(**x**_{11}) =*ϕ*_{0}, with any reasonable value of*ϕ*_{0}, for example,*ϕ*_{0}= −5 cm, and compute*ϕ̂*using (12).Then, using (7), compute

*ϕ*(**x**) over a regular grid of (*U*, SWH) values, including (0, 0). We typically use a grid size of 0.25 m s^{–1}in*U*and 0.25 m in SWH.Finally, subtract

*ϕ*(0, 0) from*ϕ*(**x**) at all grid points. The so-shifted solution obviously satisfies SSB (0, 0) = 0.

For the SIM dataset, the problem is somewhat different as the SSB is known everywhere. Thus, the exact value of *ϕ*(**x**_{11}), rather than a guess, can be imposed in step 2 of the above procedure. By doing so and skipping step 4, one obtains SSB estimates that are nearly exact close to the mean of the data distribution (*U*_{m}, SWH_{m}) but are not necessarily equal to zero at (0, 0). Use of such nonshifted-to-zero solutions proves to be interesting for error analysis, as shown in section 3c.

### c. Computing the final SSB estimate and its uncertainty

Even if we intend to significantly improve the numerical implementation of the NP SSB estimation method, we cannot reasonably envision solving (12) at once for the whole TP or SIM dataset, as it means solving a linear system with over 600 000 unknowns. We will thus follow the GF approach and compute SSB estimates, repeat cycle per repeat cycle. Unlike GF though, complete cycles will be used, without data subsampling. This requires solving a linear system with typically 6000 unknowns for each cycle. Then the final SSB estimate is obtained as the average of the individual cycle estimates computed at each grid point **x**. In the usual case where the solution is shifted to zero, the final solution is thus

where the overbar denotes an average over individual cycle estimates. Assuming independence of the *m* individual cycle estimates, the variance of the final SSB estimate is given by

where *σ*^{2}_{ϕ}(**x**) is the variance of the nonshifted cycle estimates *ϕ*(**x**). This expression is specially interesting as it clearly shows the impact of the solution-shifting procedure on the uncertainty of the final SSB estimate. This procedure actually adds the variance of the shift estimate *ϕ*(0, 0) to the variance of the nonshifted solution. This added variance is constant, but unfortunately large as the estimation uncertainty is large in the data-poor region near (0,0). An additional covariance term is also present but generally small compared to *σ*^{2}_{ϕ}(0, 0), except for values of **x** close to (0, 0).

In practice, as the variance of the final shifted solution (18) is a rather convoluted diagnosis including both estimation and shifting effects, it is useful to separately analyze the variance of the nonshifted estimate: *m*^{–1}*σ*^{2}_{ϕ}(**x**). This variance is a direct measure of the estimator skill unaffected by shifting effects, even if its value at (0, 0) is the constant variance added by the shifting process: *m*^{–1}*σ*^{2}_{ϕ}(0, 0).

### d. Validity of the results

Any statistical estimate of the SSB as a function of (*U*, SWH) is meaningful only in the zone of (*U*, SWH) plane where a sufficient number of observations exists. To help readers visualize this zone, the data boxes (as defined in Fig. 1) containing less than 30 measurements will be shaded in all figures presenting results in the (*U*, SWH) plane. As an example, Fig. 2 shows such a display of the BM4 SSB (16). Note that the shaded area contains slightly over 1% of the TP (and SIM) data.

## 4. Toward an improved nonparametric estimation of the SSB

### a. Characterizing errors of the GF SSB estimates

To precisely characterize errors likely to be present in the GF estimate of the TOPEX SSB, we first estimated the SSB of the SIM dataset in exactly the technical conditions as GF estimated the TOPEX SSB. We thus subsampled SIM to retain only 500 measurements per repeat cycle and used the NW estimator with the same Gaussian kernel and bandwidth as GF.

with *h*_{U} = 1 m s^{–1} and *h*_{SWH} = 0.4 m. This first experiment is noted EXNW. Its numerical characteristics are summarized in Table 1, like the characteristics of all subsequent experiments. The only difference between EXNW and the GF solution procedure is that we did not shift the solution to zero but rather imposed the exact value of BM4 at the ascending arc measurement closest to the mean of the (*U*, SWH) distribution. As previously explained, this allows a clear identification of the errors associated with the estimation procedure with no superimposed effect due to the shifting procedure.

The EXNW solution and its error (difference from the simulated BM4) are shown in Figs. 3a and 3b, respectively. The error is disappointingly large, in the range [−3 cm, 3 cm]. More worrisome is the fact that this error is far from random: it exhibits well-organized variations especially close to the boundaries of the nonshaded region. This appears to be a manifestation of the known bias of the NW estimator in regions of the estimation domain where the data density varies rapidly.

To understand the origin of this bias one must realize that, from a function approximation point of view, *r̂*_{NW}(**x**) is the value of the constant that locally best fits *r*(**x**), in a weighted least squares sense. Indeed, basic calculus shows that *r̂*_{NW}(**x**) as defined by (13) is the value of the constant *r*_{0} that minimizes

The NW estimator is therefore called a local constant estimator. A useful measure of the error associated to this, or any other, regression estimator *r̂*(**x**) is the pointwise mean-square error (MSE):

where *X* = (**x**_{1}, **x**_{2}, … , **x**_{n}). The MSE is easily decomposed under the form

The first component of the MSE is often referred to as the square of the bias and the second as the variance of the regression estimator. In the one-dimensional case (i.e., for a scalar *x*), the leading terms in the asymptotic expansion of the bias and variance of the NW estimator are (e.g., Fan 1992)

where *f* is the design density, that is, the density of the random variable *X*; *h* is the scalar bandwidth and *σ*^{2}(*x*) is the conditional variance of *ζ*; that is, var(*ζ* | *X* = *x*). Equivalent results exist in the multivariate case (Ruppert and Wand 1994), but the one-dimensional version suffices here to highlight the basic problem of the NW estimator. Indeed, the bias equation (23) shows that the NW estimator contains a natural smoothing error proportional to *r*″ (the well-known peak and trough flattening effect of any smoother) plus a spurious error term (*r*′*f*′/*f*) due to the local constant fit. This term simply reflects the fact that, in the presence of a slope in *r* and in the data density, a constant fit will be biased toward the *r* values found in the most densely populated zone of the local estimation window. Use of a local linear approximation of *r* instead of a local constant approximation is a known, simple and elegant, solution to this problem (e.g., Chu and Marron 1991).

### b. Improving the estimator formulation

Let us thus assume a local linear model for the regression function *r*:

with **z** in the neighborhood of **x**. The value of ** β** = (

*β*

_{0},

*β*

_{1}, … ,

*β*

_{p})

^{T}that minimizes

is the well-known solution of the weighted least squares regression problem,

where 𝗪 = diag{*K*_{H}(**x**_{i} − **x**)} is the (*n* × *n*) matrix of the weights, and

The so-called local linear regression (LLR) estimator of *r*(**x**) is *β̂*_{0}, while the other components of * β̂* are estimators of the first-order partial derivatives of

*r*(Fan 1992). One can thus write

where **e**_{1} is the (*p* × 1) unit vector: **e**^{T}_{1} = (1, 0, … , 0). Obviously, *r̂*_{LLR} conforms to (5) and, with a little algebra, it can be shown to satisfy (6). This estimator is thus directly usable in the SSB estimation procedure as a replacement for *r̂*_{NW}. The asymptotic bias and variance of *r̂*_{LLR} are (Fan 1992)

This shows that the spurious *r*′*f*′/*f* term has been eliminated from the bias expression with strictly no impact on the asymptotic variance, an excellent result that also holds in the multivariate case (Ruppert and Wand 1994). This improved estimator comes with a rather elaborate weighting scheme (28) involving tedious matrix multiplications and the inversion of 𝗫^{T}_{D}𝗪𝗫_{D}, a [(*p* + 1) × (*p* + 1)] matrix. In our case, however, *p* = 2 so that the additional time needed to compute weights only represents a small fraction of the total computation time. The bulk of the computational effort remains on the least squares solution of the large linear system (11).

To test *r̂*_{LLR} we performed a new numerical experiment (EXLLR1) in the same conditions as EXNW except that *r̂*_{NW} was replaced by *r̂*_{LLR}. The resulting SSB estimate and its error are shown in Fig. 4. The estimation error is considerably smaller than in EXNW, generally below 1 cm in the nonshaded area, except in the vicinity of (0, 0). This error no longer exhibits well-organized variations, as in EXNW. This confirms that use of *r̂*_{LLR} largely solves the bias problem, leaving essentially random errors which magnitude increases as the local data density decreases. Thus, either more data or more smoothing is requested to further reduce the estimation error.

### c. Improving the kernel choice

More data are at hand. But, so far, the number of data actually used has been limited for computational reasons. Indeed, to solve the GF SSB estimation method requires, in the least squares sense, a linear system (11) in which size is proportional to the number of data used. Gaspar and Florens implemented a classical solution based on the Cholesky decomposition of the 𝗕^{T}_{1}𝗕_{1} matrix. Other numerical solution techniques might prove to be slightly more efficient, but the real breakthrough comes from recognizing that 𝗕_{1} is almost a sparse matrix. Indeed, the matrix coefficients are the weights *α*_{n}(**x**_{1j}, **x**_{2i}). Many of them concern far apart measurements **x**_{1j} and **x**_{2i} and are thus very small, but never exactly zero in the case of a Gaussian kernel. However, other popular kernels have a compact support and thus yield truly zero weights when the distance between **x**_{1j} and **x**_{2i} exceeds the bandwidth. Since the few commonly used kernels have very similar statistical properties (e.g., Simonoff 1996), replacing one by another is known to have little effect on the skill of the estimator provided that the bandwidth is correspondingly adapted (see next section). We can thus readily replace the Gaussian kernel by a kernel with a compact support. The obvious choice is the spherical Epanechnikov kernel, which, under certain conditions, proves to minimize the asymptotic MSE of the LLR estimator (Fan and Gijbels 1996). In our two-dimensional case, this kernel reads

Using this kernel, (11) becomes a truly sparse linear system, for which very efficient iterative solution methods exist. We tested a number of these and finally selected the LSQR algorithm of Paige and Saunders (1982), which was specifically developed for large sparse linear systems to be solved in the least squares sense. This algorithm was implemented together with the Modified Sparse Row (MSR) storage scheme that allows compact storage of the 𝗕_{1} matrix (for a description, see Saad 1996). The new version of the estimation software was then tested in a simple experiment (EXLLR2) analogous to EXLLR1 but with an Epanechnikov instead of a Gaussian kernel, and with adapted bandwidths: *h*_{U} = 2.2 m s^{–1} and *h*_{SWH} = 0.9 m. The reason for multiplying the bandwidth by a factor of 2.2 when switching from a Gaussian to an Epanechnikov kernel will be explained in the next section. The SSB estimate obtained in EXLLR2 (not shown here) is very close to that of EXLLR1. Differences between the two estimates are random, ranging from nearly zero close to the peak of the data distribution up to 4–5 mm at the boundaries of the nonshaded region where the estimation uncertainty is large due to data scarcity. From an estimation point of view, switching from a Gaussian to an Epanechnikow kernel is thus essentially a blank operation. From a computational point of view, however, the effect is quite impressive: on the same workstation EXLLR1 required over 2 h of CPU time, while EXLLR2 ran in less than 15 min. The storage gain is also significant as over 85% of the elements of 𝗕_{1} are zero in the EXLLR2 experiment. Thanks to the achieved reduction in both computer time and storage requirements, the new software version allows us to reasonably consider solving the NP SSB estimation problem without subsampling repeat cycles, that is, using complete cycles, each containing 6 to 7 × 10^{3} valid crossovers. This is achieved in the next section, together with the bandwidth tuning.

### d. Improving the bandwidth selection

Bandwidth selection in the context of multivariate NP regression is a difficult and still very active field of research (for an excellent overview, see the textbook of Fan and Gijbels 1996). Before trying to implement complex bandwidth determination procedures, we performed a few simple numerical experiments in which a first guess of the bandwidths was progressively refined to reduce the difference between the estimated SSB and its known exact value (BM4). These experiments were performed using the complete SIM dataset, the LLR estimator and the Epanechnikov kernel. After a few trials we obtained a very satisfactory estimate using *h*_{U} = 2 m s^{–1} and *h*_{SWH} = 0.9 m. This estimate (EXLLR3) and its error are shown in Fig. 5. The estimation error is close to zero in most of the nonshaded zone. It reaches values close to, or slightly above, 2 mm near the boundaries of that zone but then increases very rapidly when entering the shaded zone. The estimation uncertainty at (0, 0) is close to 1 cm. This rapid increase of the estimation error as one enters the shaded zone is essentially due to the compact support of the Epanechnikov kernel. Indeed, this kernel provides markedly less smoothing than a Gaussian kernel in data-poor regions as only the very few measurements inside the kernel support area (i.e., inside the bandwidth) have a nonzero weight and hence contribute to smoothing.

Even if these results are satisfying over most of the data distribution, one is tempted to find a way to reduce the rapid error growth in the data-sparse regions, especially close to (0, 0). Additional smoothing would certainly help, but the use of unnecessarily large bandwidths should be avoided in data-rich regions as the estimator bias increases with the bandwidth (29). We are thus naturally led to consider use of locally variable bandwidth instead of a single global bandwidth, a quite common approach (Ruppert et al. 1995). One theoretical option for determining a locally optimal bandwidth is to look for the bandwidth that minimizes the asymptotic MSE. Using (22), (29), and (30), it is easily demonstrated that, in the one-dimensional case, the value of *h* that minimizes the asymptotic MSE of *r̂*_{LLR} is

where *C*(*K*) is a kernel-dependent constant:

The interpretation of (32) is simple. A larger local variance *σ*^{2}(*x*) or a smaller local data density *nf*(*x*) requires more smoothing and hence a larger bandwidth. A larger local curvature *r*″(*x*) requires a smaller bandwidth to minimize the peak and trough flattening effect of smoothing. For a Gaussian and an Epanechnikov kernel, the value of the constant *C*^{1/5}(*K*) is 0.776 and 1.719, respectively. This explains why we multiplied the bandwidth by a factor of 2.2 when switching from a Gaussian to an Epanechnikov kernel in EXLLR2. Beyond such simple proportionality rules, practical use of (32) requires considerable work as *σ*^{2}(*x*), *f*(*x*), and *r*″(*x*) are, a priori, unknown. One approach is to replace these unknowns by some estimates, leading to the so-called plug-in bandwidth selection technique. This is not a simple technique: the estimation of *r*″(*x*) only is already more difficult than the original problem of estimating *r*. One can, however, take advantage of expressions like (32) to design simple local bandwidth schemes that have, at least, some properties of the optimal choice. In our case, we wish to modulate the bandwidth as a function of the local data density. Equation (32) indicates that, in the one-dimensional case, the optimal modulation factor is *f*(*x*)^{–1/5}. The generalization to the multivariate case can be found in Fan and Gijbels (1996). It shows that the optimal bandwidth is actually proportional to *f*(*x*)^{–1/(p+4)}. A simple proxy for the local density *f*(*x*) is the number of measurements per (*U*, SWH) box, as shown in Fig. 1. This suggests the following simple local bandwidth rule in our two-dimensional problem:

where (*h*_{U}, *h*_{SWH})_{0} is a reference bandwidth, *n*(**x**) is the number of altimetric measurements in the box where **x** is found, and *n* is the mean value of *n*(**x**), the mean being taken on all boxes that contain at least one measurement. As in Fig. 1, we use (*U*, SWH) boxes of width (0.25 m s^{–1}, 0.25 m). The mean *n* is a normalization factor used to obtain (*h*_{U}, *h*_{SWH})(**x**) = (*h*_{U}, *h*_{SWH})_{0} in boxes with average data density. The resulting local modulation of the bandwidth is shown in Fig. 6. It varies from about 0.7 near the peak of the data distribution up to 2.0 at the boundary of the nonshaded region.

A new SSB estimation (EXLLR4) was obtained using the local bandwidth rule (33). As in EXLLR3, we used the complete SIM dataset, the LLR estimator and the Epanechnikov kernel. The global bandwidth of EXLLR3 is used as the reference; that is, (*h*_{U}, *h*_{SWH})_{0} = (2 m s^{–1}, 0.9 m). Results are shown in Fig. 7. In data-dense regions, the estimation error remains very small, as was the case in EXLLR3. Actually, the EXLLR4 estimate appears to be clearly better than EXLLR3 only in regions with very scarce data coverage, that is, with 10 to 30 measurements per box, which is just outside of the nonshaded zone. In that region, the estimation error now remains well below the centimeter level. Figure 8 confirms that, while the standard deviation of this (nonshifted) estimate is typically between 1 and 2 mm in the nonshaded region, it also keeps low values (3 to 4 mm) immediately outside this region. Note that the magnitude of the standard deviation is very consistent with the magnitude of the observed errors shown in Fig. 7b.

From a statistical point of view, this improvement of the SSB estimation in data-poor regions, typically concerns 1% of the altimetric measurements and may thus seem of little interest. Still, for studies of extreme events, it can be important to have reasonable SSB estimates even in very unusual sea surface conditions. In addition, use of a local bandwidth reduces the estimation uncertainty at (0, 0) and thus the uncertainty of the shifting process. We will thus keep the simple local bandwidth determination rule (33) for further work with the TOPEX data.

## 5. New TOPEX SSB estimate

The series of experiments performed with the SIM dataset show that the revised NP SSB estimation method based on the LLR estimator, the spherical Epanechnikov kernel, and a local bandwidth (33) leads to significantly improved results compared to those previously published by GF. The last step is thus to apply this revised method to obtain a new NP estimate of the TOPEX SSB. This is readily obtained by performing an estimation experiment (EXTP) in the same technical conditions as EXLLR4 but with the TP dataset instead of SIM. Note, however, that, in this case, the final solution has to be shifted to satisfy the constraint SSB(0, 0) = 0.

Let us first examine the standard deviation of the EXTP solution before it is shifted: *m*^{–1/2}*σ*_{ϕ}(**x**) (Fig. 9a), and the standard deviation of the final shifted solution *σ*_{SSB}(**x**) (Fig. 9b). As expected, Fig. 9a closely resemble Fig. 8. For both the SIM and TP datasets, the estimation uncertainty remains close to 1 mm near the peak of the data distribution and increases to 2 or 3 mm as one approaches the boundaries of the nonshaded area. Unfortunately, the estimation uncertainty is still rather large in the vicinity of (0, 0) where it reaches a value of 9 mm. This value is the standard deviation of the shift estimate. Figure 9b accordingly shows that, in most of the nonshaded area, the uncertainty of the final SSB estimate is close to 1.1 cm, roughly corresponding to a constant 9-mm uncertainty due to the solution shift and a 1–3-mm uncertainty due to the estimation itself. The covariance between the shift and estimation effects has little impact, except close to (0, 0) where it naturally causes the uncertainty to become zero, as all individual SSB estimates are constrained to be zero.

The new TOPEX SSB estimate is shown in Fig. 10a with the corresponding relative bias (SSB/SWH) in Fig. 10b. The wind- and wave-related variability of this new estimate is consistent with the known properties of the Ku-band SSB (see Chelton et al. 2001 for a review). In particular, for any given SWH, the new estimate confirms that the SSB, and relative bias, increase with *U* for wind speeds up to 11 or 12 m s^{–1} and then decrease at higher wind speeds. Also, the relative bias decreases with SWH as first noted by Witter and Chelton (1991b).

The estimated magnitude of the SSB is slightly more surprising. Most previous empirical estimates of the TOPEX relative bias are in the range −1% to −3% of SWH, with an exception for the BM4 estimate (Gaspar et al. 1994) that reaches somewhat larger values, close to −4%, for underdeveloped seas. The newly estimated relative bias is typically in the range −3% to −4% of SWH, with values as high as −5% for markedly underdeveloped seas and as low as −2% for the highest observed waves. This range of value better matches in situ measurements of the Ku-band SSB made by Melville et al. (1991) during the SAXON experiment off the Chesapeake Bay, by Arnold et al. (1995) in the Gulf of Mexico, and by W. K. Melville (1999, personal communication) in the Bass strait. Millet et al. (2001a) actually demonstrated that the differences between our new SSB estimate and in situ measurements in the Gulf of Mexico and Bass Strait are the order of 1 cm. This difference is well within the error bars of both the in situ measurements and our SSB estimate.

This new SSB estimate is directly available from the authors through (gaspar@cls.fr) as a lookup table.

## 6. Summary and conclusions

The refined NP SSB estimation method presented in this paper significantly improves upon the first version of the method devised by GF. The main changes are as follows.

The refined method uses an LLR estimator for conditional expectations instead of the NW estimator used by GF. This eliminates the bias affecting GF SSB estimates in areas with large data density variations, such as the boundaries of the estimation domain.

The refined method uses an Epanechnikov kernel instead of a Gaussian kernel. This has little impact on the estimation itself but transforms the large linear system to be solved into a sparse system for which the very efficient LSQR solution method (Paige and Saunders 1982) is implemented together with a compact matrix storage scheme. The resulting gain in computational efficiency allows us to obtain SSB estimates based on complete repeat cycles.

The refined method uses a local bandwidth determination rule instead of a global one. This allows adaptation of the amount of smoothing to the local data density. This is specially beneficial in the data-poor regions of the (

*U*, SWH) plane.

All these technical refinements are validated through numerical experiments with simulated data containing an exactly known SSB plus a realistic geographically varying noise. A new TOPEX SSB estimate is finally obtained using the refined estimation technique. This estimate exhibits a wind- and wave-related variability in agreement with the latest findings on this topic. The magnitude of the corresponding relative bias is in the range −2% to −5% of SWH, larger than most previous empirical estimates based on TOPEX data but in good agreement with different sets of in situ measurements (Melville et al. 1991; Arnold et al. 1995; W. K. Melville 1999, personal communication).

The uncertainty of the new TOPEX SSB estimate is close to 1.1 cm. It combines the estimation uncertainty itself with the (constant) uncertainty of the solution shift needed to impose SSB(0, 0) = 0. The uncertainty of the solution shift is, by far, the largest (9 mm), while the estimation uncertainty is in the range 1 to 3 mm. The only way to significantly reduce the uncertainty of the solution shift is to obtain and use more good-quality measurements close to (0, 0) in order to better constrain the estimation of SSB (0, 0).

Finally, it should be clear that the uncertainties mentioned here only concern the fraction of the SSB possibly modeled as a function of (*U*, SWH). Other factors influencing the SSB such as the fetch or the wave slope are only partly related to *U* and SWH. Their impact on the SSB will therefore only be partly reproduced in our estimate. Significant work is now under way to identify the most pertinent missing factors needed to reproduce the largest possible fraction of the SSB variability. The long-wave slope is certainly one of them (Millet et al. 2001b; Chapron et al. 2001). If we become able to accurately estimate this, or another, pertinent parameter at the global scale, the NP SSB estimation technique is now mature enough to help reliably quantify its impact on the SSB.

## Acknowledgments

The authors wish to thank Jean-Pierre Florens for useful discussions on nonparametric estimation methods and Alain Huard for his expert recommendations on numerical solution techniques for large linear systems. This study is part of the preparation activities for the JASON and ENVISAT missions. It was supported by the Centre National d'Etudes Spatiales (CNES) under Contract 731/CNES/00/8251/00 and by the European Space Agency (ESA) under Contract 12132/96/NL/GS.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

## Footnotes

*Corresponding author address:* Dr. Philippe Gaspar, Collecte Localisation Satellites, Space Oceanography Division, 8-10 rue Hermes, Parc Technologique du Canal, Ramonville Cedex 31526, France. Email: philippe.gaspar@cls.fr