1. Introduction
We present a technique for generating the covariance operators of an optimal data assimilation (DA) scheme that overcomes several limitations of the present operational recursivefilterbased technique (Wu et al. 2002; De Pondeca et al. 2011) used at the National Centers for Environmental Prediction (NCEP). The variational assimilation of atmospheric data requires that the error in the forecast background field (generally supplied by a previous shortterm forecast) be modeled. An extensive review of methods used for modeling of the background error covariance is found for example in Fisher (2003), and (Bannister 2008a,b). At NCEP, the basis for the modulation of the amplitude of the background error variance has been for years the socalled NMC method of Parrish and Derber (1992) that uses the differences of a pair of lagged forecasts. The covariance, being a function of the dynamical field errors at two spatial locations, serves as the kernel of a filtering operator and it is in this specialized role that it appears in the formulation of the costfunction minimization process at the core of any implementation of the variational principle that defines the intent of the DA process. This filter is generally considered to be spatially smooth, though not necessarily spatially homogeneous, or even horizontally isotropic. Since the meteorological fields, and therefore their increments, are known to exhibit multivariate regularities associated with dynamical balance, we demand that the covariance filter also exhibits, as weak or strong constraints, these same multivariate regularities. Another expected property of the covariance of background error is its eventual diminishing value with increasing spatial separation. The analogy between a covariance and the solution of a diffusion operator was exploited in Derber and Rosati (1989), and has been developed into more versatile and computationally efficient generalizations by Weaver and Courtier (2001), Mirouze and Weaver (2010), Weaver and Mirouze (2013), and Guillet et al. (2019), where the iterated diffusion with an anisotropic and inhomogeneous diffusivity tensor, being based on a local operator, is particularly welladapted to ocean model assimilation where lateral boundaries of each quasihorizontal layer can be quite irregular.
In NCEP’s Gridpoint Statistical Interpolation (GSI) the multivariate features mentioned above have been accommodated by projecting the idealized dynamically balanced and unbalanced contributions to the analysis increments into separate scalar fields, which can then be treated as if they are statistically independent. In this way, the linearized aspects of dynamical balance become an implicit characteristic of the analysis. These quasiindependent combinations are then smoothed using a filter of the recursive kind. The recursive filter (Hayden and Purser 1995; Wu et al. 2002; Purser et al. 2003a) is a sequence of onesided line filters applied across large regulargridded domains, in which the sequence involves both directions of travel along each line orientation, and along a sufficient set of line orientations to enable the net result of such filtering to fill out the full dimensionality of the given grid.
a. Limitations associated with the recursive filter
The recursive filter method strives to emulate the Gaussian profile in each opposing pair of backandforth applications of each basic recursive filtering operator along each line. When this pair, applied along each of the generalized grid lines threading through the lattice in a common parallel direction, is further compounded sequentially with other correspondingly backandforth sweeps along transverse families of lines, the resulting response retains the quasiGaussian character in the higherdimensionality of the grid. On a serial computer architecture, this gives rise to an exceptionally efficient numerical algorithm for generating quasiGaussian smoothing responses (much more efficient than could be done by a direct pointbypoint evaluation of the filter response whenever the characteristic scale of this response is significantly larger than the grid scale) and it allows variations in the shape of the local quasiGaussian response to occur over the spatial extent of the domain that would not be easily achieved otherwise, for example, by using spectral means. Moreover, the quasiGaussian shapes need not be restricted to kernels of a locally isotropic shape in the horizontal directions; for the RealTime Mesoscale Analysis (RTMA; Carley et al. 2020, e.g.,) it is especially important that the quasiGaussian contribution to the covariance operator be allowed to exhibit significantly anisotropic form, with stretching of the shape along the contours of the terrain, for example. In the general formulation of the recursive filter method, a number (though typically not more than three) of such quasiGaussian contributions, computed independently, and at different characteristic scales, are combined additively, to allow covariances whose spatial profiles are of nonGaussian fattailed form, which now comprise a more realistic broad range of spatial scales.
The recursive filter algorithms have proven to be a successful approach to addressing the challenging problem of numerically simulating background error covariance in a computationally efficient manner on machines with small to moderate degrees of parallelism. However, being intrinsically sequential and by having infinite impulseresponse functions (e.g., Hamming 1989), the recursive procedures are not optimally adapted to the modern, massively parallel architectures of the new generation of computers. Neither is the recursive filter formulation able easily to accommodate the more generic multivariate correlations that must clearly be present between dynamically “balanced” and “unbalanced” components in an evolving adaptive scenario, since these components are presently treated, erroneously, as statistically independent.
b. The beta filter and its advantages
The new massively parallelcomputer architectures therefore require us to review the basic procedure of covariance generation and find a reformulation to overcome the inherent limitations of the recursive filters. It is to address these challenges that we have formulated the multigrid beta filter covariance method described in this paper. The “support” of a filter denotes the portion of its range where the response to an impulse is nonzero. Thus, the Gaussian function itself has infinite support—its calculation also involves evaluating exponential functions, so it is doubly handicapped as a practical choice for a covariance component. Recursive filters involve only simple computations, but unfortunately they also have the handicap of infinite support. The filters at the core of our new scheme remain quasiGaussian but, instead of being recursive, and therefore burdened with effectively infinite support, the new filters are based on finiteimpulseresponse “beta distributions” applied as explicit filters. In this respect they are like the explicit compactsupport covariance functions proposed by Gaspari and Cohn (1999) who, by modeling families of covariances as piecewise rational analytic functions, were successful in constructing useful covariances, allowing general pointtopoint evaluations, exhibiting realistic departures from the thintailed Gaussian. In our case, however, we have the multigrid machinery at our disposal to take care of the shaping of the final covariance profiles from quasiGaussian ingredients (together with their second derivatives in a slightly more sophisticated “Helmholtzweighting” extension that we shall briefly touch upon), and we have the advantage of a regular grid to operate on. Therefore, we need only emulate, in the simplest algebraic way, the compactsupport gridded approximations to Gaussian in our basic filters. These filters can consequently be of an algebraically and computationally simpler polynomial form.
A beta distribution, in probability and statistics, is one with support of finite width, and which, by a judicious choice of the pair of shapecontrolling parameters, can be made a filter of a bellshaped profile. The “beta” name arises from the fact that, in order to ensure that the integral of a beta distribution is one over the standardized centered interval of unit width, the normalizing constant is exactly the Euler beta function of the pair of filter parameters (Abramowitz and Stegun 1972). In this sense, it might be legitimate to regard the distribution as the “incomplete Euler beta function.” When, as we always assume, the pair of standard parameters of the beta function are equal, then one less than this shared standard parameter will be taken to be our own beta filter parameter p, and then the profile is symmetrical within its interval of support. When the parameter p is not too small the shape resembles a Gaussian (the resemblance improving as the parameter increases). When the parameter is an integer, the functional form is simply a polynomial of degree 2p, which is therefore quite easy and cheap to apply; thus, we always assume the shape parameter to be some positive integer. Since the beta function line filter mimics the Gaussian, it can be used in place of the recursive filter as a basic component of the same compounded filters that the recursive filters led to. Although, by itself, iterating a beta line filter is computationally more expensive than the recursive filter, its advantage of having a finite impulse response enables the domain to be dissected into overlapping regions to which we can apply efficient parallelization, which becomes the decisive factor in determining the speed of the algorithm overall.
The beta distribution with common integer parameter, being symmetric, is clearly an even polynomial evaluated about the center of its support interval, so it can therefore be immediately generalized to a radially symmetric distribution in any number of higher dimensions. Since the parameter is an integer, the response function remains a polynomial, in the more general multivariate sense, in the Cartesian spatial coordinates. A linear transformation of these coordinates, by application of a shapecontrolling symmetric “aspect tensor,” will result in another quasiGaussian response, but of the anisotropic kind. There is a spectacularly high computational cost of applying an explicit multidimensional filter whose characteristic scale significantly exceeds the grid scale, but for a smoothing component, which comprises a relatively coarse scale, this cost can be significantly mitigated by engaging the second characteristic feature of the new approach—the multigrid feature.
c. Multigrid formulation
In a multigrid method (Brandt 1977, 1997; Hackbusch 2013), whether it is to solve an elliptic problem (its traditional and most familiar application) or an integral equation (such as the present filtering problem), the portions of the problem that can be dealt with at a coarsescale grid are dealt with on a grid of commensurate spacing; portions of the increment solution that require a fine scale are efficiently computed on a grid of commensurate spacing. Additive contributions of the covariances that have smallscale details and are therefore composed of quasiGaussians of small scale, are also efficiently computed because, although they are calculated on a finer grid, the smaller physical scale of the elliptical or ellipsoidal footprints of these quasiGaussians still involve about the same number of grid points in each impulse response function. In this way, the combination of a multigrid computational structure with explicit quasiGaussian filters, is able to overcome the limitations that accompany single grid implementations of an explicit smoothing filter when the increments possess some characteristic scales of components that are large compared to the grid and some that are comparable with the grid spacing.
Although this is a topic whose detailed presentation will be reserved for future publication, we mention that the ability of a finitesupport filter to allow domain decomposition also makes it computationally easier to introduce nontrivial multivariate couplings between pairs of the analysis variables. In the simplest instances, this can be done by generalizing the multigrid “scaleweights” at each grid generation. Instead of restricting these weight fields to being scalars operating separately on each of the analysis variables, they can be made matrix weight fields that couple the different variables. As we indicate briefly in section 3a, we can formally generalize these weights further to be differential operators, at the highest generation at least, which provides a practical way of producing covariances possessing negative side lobes, if these are desired.
d. Outline of the paper
Section 2 presents in more details the beta filter. Our approach to the multigrid (MG hereafter) paradigm for application of the beta filter is discussed in the following section 3, where we also show a few preliminary examples of performance derived in standalone test cases. An efficient organization of calculation, that enables generalization and practical application of the MG filtering is described in section 4. Several examples of method performance, derived in a standalone version, where the filter is tested outside of the data assimilation framework, and a preliminary implementation in the GSI, are shown in section 5. The paper concludes with discussion, summary and a brief outline of plans contained in section 6. More technical details describing options for further speedup, and a strategy for application of the MG beta filter on the cubed sphere are reserved for appendices.
2. The beta filter
a. Background error covariance
The background error covariance (B) is an operator whose inverse defines the effective weight attributed to the background field in the variational DA, just as the inverse of the observation error covariances defines the effective weights applied to those observation values. It is impractical to hold an explicit representation of B when this covariance is spatially inhomogeneous, and equally impractical to attempt to apply direct matrix methods in the solution of a variational problem. Instead, we break down the approximate representation of the matrix into small manageable filters and apply iterative methods, such as a conjugate gradient method (e.g., Gill et al. 1981), which help to approach the desired solution in about 100–150 iterative steps. Each step requires only the work whose dominant part is equivalent to multiplying vectors by B, or by the simpler factors into which it can be broken. This process does require some conditions to be upheld on the structure of the approximation for the operator, such as strict selfadjointness (essentially equivalent to matrix symmetry) and nonnegativity (in terms of the signs of the eigenvalues of the approximation to B).
δx = x^{a} − x^{b} is the analysis increment,
d = y − H(x^{b}) is the innovation,
y is the observation vector,
B is the background error covariance matrix,
R is the observation error covariance matrix,
H is the linearized observation (forward) operator.
b. Recursive filters
Recursive filters (e.g., Wu et al. 2002; Purser et al. 2003a,b), which have, for years, been used in the GSI, represent an efficient and an exceptionally good approximation to the Gaussian (see Fig. 1), allowing inhomogeneity and anisotropy of analyzed fields to be accounted for.
On the downside, recursive filters are inherently sequential operators, making them exceedingly difficult to successfully parallelize, and they have an infinite support. In addition, without the benefits of a multigrid structure, the filters, as presently formulated:

cannot describe covariances across various scales,

neither can they consider cross correlation,

nor do they provide negative lobes, which realistic covariances in some situations may possess.
c. Beta filters
d. Direct and adjoint beta filtering of scalar fields
3. Multigrid approach to the application of the beta filter
Examples of the application of the multigrid method in DA are found in: Kang et al. (2014), as a method for minimization of the cost function; Li et al. (2008) for assimilation of ocean temperature; Li et al. (2010) for twodimensional Doppler radar radial velocity assimilation; Zhang and Tian (2018), Zhang et al. (2020) for a nonlinear least squares fourdimensional variational data assimilation, etc. In our case, the beta filter is used at a hierarchy of different scales, combined into a parallel multigrid scheme to achieve a larger coverage and potentially a more versatile synthesis of anisotropic covariances, allowing a greater control over the shape. In this section, we will first describe the core of our multigrid approach, and in the following discuss a technique that we developed to ensure the general applicability of the code. By that, we refer to the capability of the code to perform optimally under various grid decompositions across a multiprocessing computing platform.
a. Selfadjoint operators
Recall that B needs to be a selfadjoint operator which converts an impulse into a weighted superposition of quasiGaussian shapes, or of derivatives of these shapes. The construction of B achieving nonnegativity and selfadjointness in a factorization, B = B^{†} = CC^{†}, does not require operator C, as a matrix, to be “square.” The weighting operator in the superposition may be a positive scalar function of position, sandwiched between the conserving and smoothing beta filter stages. This construction obviously remains compatible with the desired B = CC^{†} factoring.
The advantage of introducing a Helmholtz weighting of this form is that it opens a greater possible range of the covariance shapes that can be formed from positive superpositions of such contributions at each generation of the multigrid combination. For example, the derivative terms, sufficiently weighted, give rise to “sidelobes” of negative values in the response to a positive impulse input, which cannot possibly be obtained when the individual multigrid contributions to this response are restricted to being purely the selfconvolved beta functions at each scale (since these are positivevalued functions within their support).
A schematic representation of that procedure is shown in Fig. 3.
b. Grids and decompositions
It may be useful at this point to clarify some of the conventions used throughout the paper:

Grid: represents a distribution of grid points over a computational domain

Analysis grid: contains control variables

Filter grids: handle filter variables
Filter grids are defined at several different resolutions, referred to as “generations,” and denoted here as g_{1}, g_{2}, …, where generation 1, that is, g_{1}, has the highest resolution; generation2, or g_{2}, has half the resolution of g_{1} in each direction, etc. In the simplest case, all grids from this multigrid structure cover the same computational domain. In the generalized version of the algorithm, higher generations of the filter grid may cover somewhat fictively larger domains, created by padding with zeros, for reasons that will become clear later on. Generally, the filter grid at generation1 has a resolution equal to, or lower than, that of the analysis grid, conveniently denoted by g_{0}. In addition, we will use the term “decomposition” or “distribution” to describe the arrangement of grid points over processing elements (PEs) of a parallelcomputing platform.
Generally, the computational domain is decomposed in such a way that all PEs at all generations have the same number of grid points. Mapping from the analysis to the filter grid is done using adjoint interpolations, and mapping back following filtering is done using direct interpolations. In our case, we use linearly weighted quadratic interpolation so as to preserve continuity of derivatives. The whole cycle of filtering in a variational optimization procedure is enclosed within each inner iteration of the minimization procedure, as shown in Fig. 3. It starts with mapping from the analysis to the first generation of filter grid, filtering, and mapping back to the analysis grid. Filtering itself consists of two successive steps: adjoint (“conserving” stage) and direct (“smoothing” stage). The “conservative” and “smoothing” terminology comes from the attributes of the normalized (unweighted) inhomogeneous filters. A conserving filter, like an “adjoint interpolator,” has the property of ensuring that the integrated substance after the filter’s application is identical to the integrated substance of the input distribution upon which it acts, even as the aspect tensor varies. This “conserving” attribute may be thought of as the consequence of the filter, or adjointinterpolation, being the exact adjoint of a filter, or interpolator, whose action upon an input that happens to be perfectly uniform, leads to an identical uniform value in the output. Of course, it would be desirable in many applications if a single filter could be fashioned to always possess both the exactly “conserving” and “smoothing” attributes simultaneously (e.g., with recursive filters, this is possible even when the response is required to be both anisotropic and spatially inhomogeneous), but for our purposes, this dual attribute is not a necessary requirement and, in the case of our beta filters, is not something that is easily achieved either.
In the context of building a practical covariance operator to characterize the spatial distribution of errors of a suitably chosen variable in an atmospheric forecast background field, then apart from the smooth positive weighting that always modulates the individual quasiGaussian contributions in order to match the intended background variance (and, to some degree, the intended covariance “shape”), it is the attribute of “smoothness” that we most want to preserve in the final output. For this reason, we invariably perform the selfadjoint filtering combinations in such a way that the “conserving” adjoints of the “smoothing” filters and interpolators are applied first in each factored selfadjoint combination, and the corresponding filters and direct interpolators themselves are applied last. These guaranteed selfadjoint contributions are then further combined, but now additively, or “in superposition,” to synthesize the final covariance operators for each control variable of the assimilation, thus guaranteeing smoothness, selfadjointness, and nonnegativity.
It is acceptable for the B operator to possess a null space as long as the null degrees of freedom do not correspond to scales or structures where a nontrivial analysis increment is required. As a spectral analysis of the continuum beta filters reveals (Purser 2020a), there will be wavenumbers where the spectral transform changes sign or momentarily vanishes, implying a null space for the corresponding selfconvolved filter. But such sets of wavenumbers are of zero measure and are unlikely to play a role in the discrete filters, especially once a few such selfconvolved filters, at different characteristic scales, have been positively superposed in the multigrid scheme. Where there almost certainly is a substantial null space is at those scales unresolved by the generation1 filter grid of the multigrid’s structure, but resolved by the analysis grid itself, since these barely resolved components of the analysis grid will be invisible to (i.e., in the null space of) the adjoint interpolation operator that goes from the analysis grid to the first generation of the filter grids.
In its simplest form, our parallel multigrid scheme assumes that all generations cover the same area and that each consecutive grid generation halves the resolution (in each direction) of the previous, lower generation. The computational domain is decomposed in such a way that all PEs at all generations have the same number of grid points.
A typical scheme fitting this situation is shown in Fig. 4, where successive grid generations, g_{1}, g_{2}, g_{3}, and g_{4}, cover the same domain with successively 64, 16, 4, and 1 PEs, respectively, each with the same number of grid points.
c. Stages of multigrid procedure
Our multigrid algorithm, as schematically described in the boxes in Fig. 3, consists of the following steps:

Adjoint interpolate filter fields (or “upsend”) them from g_{1} to g_{2}, then from g_{2} to g_{3}, and so on, all the way to g_{n} (I^{†})

Apply the adjoint beta filter at all generations (F^{†})

Apply weighting at all generations (W = GG^{†})

Apply the direct beta filter at all generations (F)

Interpolate (or, “downsend”) result of the adjoint of the beta filter from g_{n} to g_{n}_{1}, and add it to the existing adjoint of the beta filter at that lower generation. Then, repeat the procedure all the way to g_{1} (I)
Steps 1 and 5, upsending and downsending, are sequential, while step 2 (adjoint beta filtering), step 3 (weighting, possibly with the Helmholtz style), and step 4 (direct beta filtering) are parallel. Technically, in the upsending stage we halve the resolution and correspondingly reduce the number of the processing elements, while in the downsending stage we double the resolution and increase the number of processing elements.
To give some insight into the effect that these operators have, Fig. 5 shows the response function in the case of a threegeneration MG scheme to a unit impulse, in the 1D case. Figure 5a shows the effect at the fine grid of operating with the combination only of adjoint and direct interpolation operators from an input pulse at the fine grid that does not coincide with the points of the coarser grids. Figure 5b shows, in the red curve, the application of a broad beta filter combination, FF^{†} on the fine grid, together with the equivalent amount of smoothing of the adjointinterpolated initial impulse done at the coarse grid and interpolated back down again to the fine grid. Figure 5c shows the same comparison except with the inclusion of a Helmholtz operator. The effect of the adjoint and direct interpolation operators on these smooth and broad functions is minimal. The ability of the Helmholtz operators to form sidelobes is apparent in Fig. 5c.
Since we always deal with several 2D and 3D variables, it is useful to create a composite variable for filtering which simplifies coding and speeds up the execution by avoiding short messages and redundant repetitions.
4. Organization of computation
The basic MG structure, as described in the previous section, assumes that all grid generations reserve separate processors, which guarantees a parallel execution of the beta filter. However, though appealing, a direct implementation of such a paradigm is impractical, and is especially hardly applicable across a large range of processing elements. Thus, we had to carefully investigate how best to apply the MG beta filter and at the same time efficiently utilize available processing resources. To illustrate the nature of the problem, we will first briefly describe various solutions, preliminarily outlined in Rančić et al. (2020), that we have considered along the way, before showing the one that was able to optimally satisfy most of the necessary requirements.
Let us first denote by N and M the numbers of processors which hold control fields in x and y directions, respectively. Let us also assume that each of the analysis fields is evenly spread among processors, which can always be accomplished using padding toward ends of the domain using one or another method for extrapolation. Alternatively, one can slightly modify the size of input background fields so that they can be uniformly spread across given processors. Thus, we assume after such regularization, the analysis variables (g_{0}) will be uniformly spread with p_{0} = N × M processors.
a. Option 1
Figure 6 shows one solution for the organization of the calculation. For simplicity, we assume that analysis grid (g_{0}) is decomposed over 10 × 10 PEs. 64 processors handling grid g_{1} are arranged from the lower left corner of the arrangement for g_{0} in a topological order that corresponds to the geometry of the domain. We developed a relatively efficient code for remapping and redecomposition of g_{0} to g_{1} and back. The higher generations are then fit among free PEs of g_{0}. The problem with that approach is that too much of the computation time is spent on data motion in redecomposition between grids g_{0} and g_{1}.
b. Option 2
Since the RTMA mainly runs on rectangular domains, with typically one dimension significantly exceeding the other, we next tried another solution, shown schematically in Fig. 7. This solution assumes that g_{0} and g_{1} use the same number of processors in one, typically y direction, which eliminates the need for redecomposition in that direction, reducing in this way the computation time. If we want to use more processors for g_{0} we can do that by adding them to the right side of the presented arrangement.
c. Option 3
d. Option 4—Generalized solution
None of these solutions turned out to be fully satisfactory. Essentially, the main problem with all of them is the lack of flexibility to enable an efficient generalization of the code for operation on a different number of processors. A truly effective code must not be hardwired but must be able to effectively adjust to different numbers of processors and various resolutions. The final option that we will show next satisfies all these important criteria. However, to that end, we had to give up a full parallelism of the multigrid generations, and instead require that the higher generations are calculated in parallel among themselves, but sequentially with generation1.
We explain this solution, which will be referred to as “generalized,” with the aid of stencils in Fig. 9. We assume that the analysis grid (g_{0}) operates on 11 × 8 PEs (as in option 2). However, the generalized method will work with any rectangular arrangement of processors at generation0, with the only condition that it is possible to uniformly distribute grid boxes among them. The essence of the method is that generation g_{1} of the filter grid is distributed across all given processors, just as the analysis grid (g_{0}) is. Thus, mapping between them (the first and the last tasks on Fig. 3) is done only using adjoint and direct interpolations within the same processors, no longer requiring redecomposition and motion of data, which eliminates the main slowdown of all other previously shown options. Additionally, this allows that calculation of generation g_{1} of the filter grid is spread among all processing elements, not just a portion of them. In the solutions discussed so far, we group four processors from the lower generation to access the next higher generation. To this end, the number of processors of generation1 had to be divisible by 4. Within the generalized method, we again form higher generations by grouping the content of 2 × 2 processors through adjoint interpolations. However, if the number of processors in one or both directions is odd, we simply add in that direction an extra virtual processor by supplying its content with zeros, as shown in Fig. 9.
Note that the effective computational domain size stays the same, and that there is no computation in the portion of processors that cover fictive (or “ghost”) space padded with zeros. All higher filter generations are executed concurrently, but sequentially with generation1. Thus, we can physically place the content at these processors in the processors of generation1, as it is done in Fig. 10, which supplements Fig. 9.
In this paradigm, all generations have the same data load, and g_{0} is executed sequentially with higher generations. However, since the generation1 workload is spread among all available processors, and since there is no need for redecomposition in mapping between g_{0} and g_{1}, based on timings that we got in the preliminary testing, we estimate that this method, in addition to providing a full generalization of MG beta filtering, would actually perform faster than any of the previous versions. The coding of this approach can be organized in such a way that identification of higher generations, and the processors in charge of them, is done automatically. Practically, once the requirement that data of g_{0} are evenly spread across given processors, and resolution of the filter grid is set, we can just ask for any number of rectangular arrangement of processors, and the code will automatically handle the higher generations. The only other requirement is that resolution of filter grid within a processor (defined in this case as the number of grid spaces in each direction) is divisible by 2^{n−}^{1} where n is the number of generations. That option guarantees that we can use mirror boundary conditions for the increments at the same physical location for all generations, which is a simple way of defining them in a manner that guarantees selfadjointness.
Assuming that data of both the analysis and the first generation of filter grid can be evenly distributed among the given N × M processors, these numbers become the only input in the code. In addition to increasing the versatility of the method, option 4 showed the best computational efficiency, as will be demonstrated later in section 5, and therefore it became our primary solution for organization of the computation in application of the beta filter for modeling of covariances. Various opportunities for further speedup of the code are discussed in appendix A.
5. Test results
In this section we present results of several preliminary tests that we ran during the development of this method. All tests are run on the Weather and Climate Operational Supercomputing System (WCOSS).
a. Case 1
In this set of tests, we were not concerned with the decomposition but rather with various effects that the parameters of the aspect tensor and the scaleweight operators may have on the shape of covariances. The results of the idealized homogeneous 2D MG beta filter acting on a delta function impulse, summarized in Fig. 11, are derived using the simplest multigrid scheme shown before in Fig. 4. Parameters A_{0}, ϕ, and θ come from the definition of the aspect tensor in (4), and the coefficients a and b defining the isotropic Helmholtz style of the scale weight operators, are introduced in (11). Note that only the last of the illustrated examples uses a nontrivial Helmholtz operator (and then, only at the highest generation). The large weights at the higher generations spread a covariance resulting from filtering of an initial delta function impulse, and at the same time increase its amplitude. The definition of the aspect tensor can produce various rotations of covariance subjected to stretching. The areal parameter A_{0} spreads or shrinks the effective area of the covariance, but the larger values require larger halos, increasing the cost of filtering.
b. Case 2
In the next set of tests, we investigate the 3D problem through a setting that closely fits the realistic conditions in the GSI. We ran a series of tests of a standalone version of the multigrid beta filter with a resolution of the analysis grid that corresponds to the RTMA domain (1804 × 1072 grid distances) and with a filter grid at resolution (1760 × 960 grid distances), both with 50 vertical levels. In terms of the decomposition of the aspect tensor given by (4), the characteristic areal parameter A_{0} was 1, which required eight points of halo in all directions. The other, anisotropy parameters, θ and ϕ, of the aspect tensor were set to zero. A filtering option with application of the 2D version of the radial filter in the horizontal and 1D in the vertical direction was used. The code used six 3D variables and four 2D, one of which did not share common boundaries with neighbors, which mimics the situation that exists in the GSI. The six 3D variables used for filtering in GSI are streamfunction, velocity potential, temperature, specific humidity, ozone, and cloud condensate. The four 2D variables are surface pressure, sea surface temperature, land surface temperature and ice surface temperature. At the end of each inner iteration the GSI combines these last three distinct variables, with the help of land/sea/ice masks, to get the skin temperature.
Figure 12 shows timings derived in a test run on 11 × 8 PEs in which the generalized option 4 is compared against identical test run with option 2, the most promising among all other considered. (Recall that the other two options were presented only to better explain the problem of decomposition that we encountered while developing this method.) In option 2, all filter generations are run in parallel, but that required a redecomposition of data in mapping between the analysis grid and the first generation of filter grid and back, and the first generation of filter grid ran on fewer processors. The motivation for introducing option 4 was to generalize the MG beta filter. However, it turned out that this approach also increases the efficiency of calculation, in this case by about 10%. Note that option 2 would be very difficult to apply across various arrangements of processors because of the need to tailor the code that describes the multigrid structure specifically for each new decomposition. In contrast, the generalized version can run on various constellations of processors, without any need for additional interventions.
The way that option 4 scales with the number of processors in a test with the generalized version of the standalone version derived using resolution with 2430 × 1080 grid intervals on the analysis grid and 2160 × 900 on the filter grid, with up to 540 PEs running on the WCOSS, is shown in Fig. 13.
c. Case 3
The MG beta filter was implemented in the GSI over the RTMA domain and a series of preliminary integrations were run using a single observation test case, only using the generalized option 4. At this time, we were focused on potential gains in computational efficiency in comparison with the parallel tests run with the recursive filter, while other potential benefits that are expected to come through a better and a more versatile description of covariances across spatial scales of the MG beta filter will be analyzed at a later stage.
In this exercise we used a slightly reduced version of the RTMA domain with a resolution of 1792 × 1056 grid intervals and prescribed various uniform decompositions across processors in a more measured way as summarized in Table 1. The value of beta filter parameter p was set to 2, and the aspect tensor was adjusted so that the resulting covariance had a similar shape as in the test case with the recursive filter. In this case, the size of the halo was eight grid intervals, which we found experimentally. Note that we used here a slightly lower resolution for the filter grid, making sure that in each direction the number of intervals is divisible by 8, which enables the running of four multigrid generations in all tests.
Decompositions (N × M) and resolutions of analysis (n × m) and filter (i × j) grids in each processor for different decompositions used in tests of the GSI.
The derived results are presented in Table 2 and Fig. 14, which show, respectively, times, and a plot of inverses of times spent on filtering per iteration, as derived in the described tests for the recursive filter (RF) and the MG beta filter. In addition, according to Fig. 14, the MG beta filter scales much better by continuing to increase the efficiency as the number of processors increases. In contrast, the RF quickly reaches a saturation point, and further increase in the number of processors does not significantly improves its performance. Thus, with an increasing number of processors, the MG beta filter becomes more and more efficient relative to the RF (see last row in Table 2). With the set parameters, MG beta filter in these tests becomes even 7 times more efficient when running with 1056 PEs.
Times per single iteration in seconds derived in a single observation test of the GSI using recursive filtering (RF) and the multigrid beta filter (MF) running with different numbers of processors (PEs).
6. Conclusions
This paper summarizes work on development of a new method for modeling of background error covariance for application in the 3D RTMA system as a replacement for the recursive filter. The key prerequisite for the success of this enterprise, consisting of 3D analysis at a horizontal resolution of 2.5 km in frequent time intervals of 15 min, is a vastly improved efficiency. The new approach to modeling of the background error covariance, which was discussed here, is one of the key components for the success of that effort.
The new approach uses a beta function for construction of the filter, which has a compact support and thus is much better adapted for parallelization than the recursive filters. The effect of various scales is taken into account through a multigrid procedure, which is generalized so that it can run on an arbitrary number of processing elements, assuming only that the number of grid points on the filter grid in each direction can be presented as a product of two integers. A more detailed analysis of effects of various parameters of the MG beta filter on the covariances, separate from computational implementation, would require much more space, but we hope to provide such an analysis in future, maybe in the context of response to a single observation forcing within the GSI.
The described MG beta filter comes in several flavors, some of them still in their final developmental stage, which will be described in this section.
The radial beta filter itself has 1D, 2D and 3D versions. Thus, we can apply filtering in:

Horizontal directions only

All three directions, using a sequence of 2D filtering in horizontal and 1D in vertical for 3D variables. (Timings shown in Figs. 12 and 13 are derived with this combination.)

All three directions, using a 3D version of the radial beta filter
Potentially highly efficient “line” versions of the suite of beta filters, the “Triad” (three sequentially applied components, in 2D), and the “Hexad” (six sequentially applied components in 3D), are also being developed, which may replace the present “radial” versions of our beta filter. These advantageous alternatives are achieved by exploiting the symmetries of a regular grid using the powerful and elegant methods of group theory (Purser 2020b,c), and will be described in a future article. All described types of the radial beta filter (two and threedimensional, isotropic and anisotropic) are also available with the line filter. A more detailed comparison of the effects of these various filtering flavors is also reserved for future investigations.
We only foresee applying the method on structured grids, which is a category that includes cubic and icosahedral global grids, since these grids are welladapted to multigrid treatments. Although the radial beta filter, with some effort, could be made to apply to an unstructured grid, it is not clear how the multigrid architecture could generalize in the unstructured case.
The primary objective here was to improve efficiency through a better utilization of multiprocessing computing resources. Yet the developed method has the potential to generally improve performance of the analysis. For example, a consistent extension of this approach led us to a fully 4D extension (the socalled “Decad” algorithm, Purser 2020b) giving us a tool that would enable future extension of the RTMA procedure into a fully 4D scheme.
The scalar versions of the scaleweights W for each generation can be generalized to supply the scheme with cross covariances among the different analysis variables. This can be accomplished by replacing, at each generation, the scalar scaleweight, W = G^{2}, separately defined for each analysis variable, by positivedefinite symmetric matrices, allowing factoring, W = GG^{†}, that simultaneously couple the in situ vector of the different analysis variables together, assuming that all analysis variables are being operated on in parallel. Furthermore, the more general factored Helmholtz weights themselves can also be generalized into corresponding factored matrices, G and their formal adjoints G^{†}, such that they contain both scalar and differential operator elements that crosscouple the different analysis variables. In principle, this should enable the coupled increments to be correlated and to exhibit directional phase shifts in these pairs of coupled increments, as is generally expected for realistic error increments. However, the detailed elaboration of these generalizations of our approach will be left for future studies. Among other things, in the future we also plan to apply machine learning (ML) machinery for estimation of cross covariances using ensembles for training. This, we hope, will allow us to step toward a new paradigm of running a pure variational DA instead of a hybrid, with potentially a huge saving in computational time at some future stage. We also started development of a cubedsphere version of the MG beta filter, targeting application within the Joint Effort for Data assimilation Integration (JEDI) (https://jointcenterforsatellitedataassimilationjedidocs.readthedocshosted.com/en/latest/index.html), which will allow us to add our approach to the “BUMP” covariance initiative (https://github.com/benjaminmenetrier/bump). More details of this development are supplied in appendix B.
The implementation of a MG beta filter in the GSI (as a backup version for running 3D RTMA) is in the final stage. Since the primary goal of 3D RTMA is application in JEDI, a preliminary implementation of the MG beta filter in JEDI is also under way. An analysis of the effects of the new beta filter on the analysis will be reported elsewhere.
Acknowledgments.
This material is based upon work initially supported by the EPIC Program within the NOAA/OAR Office of Weather and Air Quality under the title “Development of a multigrid background error covariance model for high resolution data assimilation,” and presently by the Unified Forecast System Research to Operation (UFS R2O) Project, which is jointly funded by NOAA’s Office of Science and Technology Integration (OSTI) of National Weather Service (NWS) and Weather Program Office (WPO), [Joint Technology Transfer Initiative (JTTI)] of the Office of Oceanic and Atmospheric Research (OAR). We are grateful to Dr. Jacob Carley for his unwavering encouragement and support of this project and to Drs. Ting Lei and Kristen Bathmann for their helpful suggestions following internal reviews of the manuscript. We also would like to recognize the unknown reviewers for their dedicated efforts, which helped us improve the paper.
APPENDIX A
Further Speedup of Generalized Version of MG Filter
Further potential speedup in application of a generalized version of MG beta filter may come from vertically splitting the 3D arrays of higher generations and sharing their load among processors. Moreover, we can combine this method with a judiciously chosen lower vertical resolution of higher generation, which give us a chance to devise a scheme that will better utilize the processors and further speed up the execution.
In the considered example with 11 × 8 PEs, generation g_{1}, which occupies all 88 PEs, has 50 vertical levels. Let us assume that generation g_{2}, which in Fig. 10 occupies 24 PEs is instead of 50 given 45 vertical levels, and divided in 3 layers, each with 15 levels. Generation g_{2} can now be evenly distributed among 72 PEs. Similarly, let us assume that generation g_{3}, that occupies 6 PEs, and generation g_{4}, that occupies 2 PEs are given 30 vertical levels, and that they are both divided in 2 layers, each with 15 levels. This budgeting is shown in Table A1.
Using this method, we can fully exploit all available processors as show in Fig. A1.
Budgeting of the computational load in the case with 11 × 8 PEs, where the higher grid generations use lower vertical resolutions and vertically split arrays. Originally, the higher grid generations g_{2}, g_{3}, and g_{4}, with the same vertical resolutions, occupy 24, 6, and 2 PEs, respectively. After reduction of their vertical resolutions and vertical partition, they can be evenly redistributed among, respectively, 72, 12, and 4 PEs, to populate all 88 PEs, as shown in Fig. A1.
APPENDIX B
Extension of the MG Beta Filter for the Cubed Sphere
Following an early suggestion by Sadourny (1972), the cubed sphere became one of the standard grid geometries in numerical modeling of the atmosphere in general, and is used as the underlying grid geometry in the global atmospheric model of the Unified Forecasting System (http://ufsdev.rap.ucar.edu/), finite volume model, FV3 (e.g., Putman and Lin 2007). The straight grid lines in a gnomonic projection map back to great circle arcs on the sphere, but there is some freedom to choose the profile of spacing between the grid lines in each family. An example of modeling the background error covariance directly on the cubedsphere geometry is found in Song et al. (2017). The performance of the MG beta filter in DA for a global version of the FV3 will be presented and discussed somewhere else. Here, we just outline the approach used to apply the MG beta filter on this grid geometry.
Since the filtering can be performed on a slightly different version of the gnomonic family of cubed sphere grids from the one used for the final analysis increments, we can choose the computationally advantageous “equiangular” version as a grid of choice for filtering. The reason is that, when applying a beta filter, one must always supply a sufficiently large halo (whose extent depends on the spatial span of filter) and populate it with data from the surrounding processors. In the case of the gnomonic cubed sphere used in the FV3, the coordinate lines are arcs of great circles which are interrupted at the cube edges. Not only are the grid lines interrupted where they intercept the cube edges, but the smooth progression of the family of grid arcs that are approximately parallel to one edge will exhibit an interruption of the profile of spacing in passing from one side of this edge to the other. This necessitates that the process of matching the results of filtering spilling over from one side of the edge to the other would involve twodimensional interpolations in the general case. However, as Ronchi et al. (1996) pointed out, using the equiangular version of the gnomonic grid, where this second kind of interruption is not present, one is able to perform the needed interpolations onedimensionally, along those same grid arcs. Nevertheless, since the analysis grid is defined on an equalalongedges version of the gnomonic cubed sphere grid (slightly different from the equiangular version), and at a higher resolution, we do need to remap (within the same processors), in fully two dimensions, between these two grids at the beginning and at the end of the filtering procedure.
The multigrid scheme for filtering used in the regional 3D RTMA extends the domains of higher generations, if the number of processors is odd, only in one (east or north) direction, which on the cubed sphere clearly will be inappropriate. Therefore, we came up with the following scheme: If the number of processors in each direction (N in x direction and M in y direction) of one generation is divisible by 2, the domain of the next generation stays the same and the number of processors in each direction is divided by 2, so that number of processors dealing with that next generation is (N/2) × (M/2). However, if the number of processors at one generation in, for example, the x direction, is not divisible by 2, the domain of the next generation will be slightly extended in that direction (with the extra domain populated with zeros), and the number of processors for this generation will be [(N + 1)/2] × M/2. We illustrate this situation in Fig. B1.
In this case, generation1 has 5 × 5 = 25 PEs, and next, generation2, will have 9 PEs. Processors 0, 2, 4, 22, 24, 26, 44, 46, and 48 are, after adjoint interpolation to lower resolution, completely mapped into the middle portion of virtual processors of the next generation. Processors 1, 3, 23, 25, 45 and 47 are divided vertically; 11, 13, 15, 33, 35 and 37 are divided horizontally; and 12, 14, 34 and 36 are divided into four pieces.
REFERENCES
Abramowitz, M., and I. A. Stegun, 1972: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, 1046 pp.
Bannister, R. N., 2008a: A review of forecast error covariance statistics in atmospheric variational data assimilation. I: Characteristics and measurements of forecast error covariances. Quart. J. Roy. Meteor. Soc., 134, 1951–1970, https://doi.org/10.1002/qj.339.
Bannister, R. N., 2008b: A review of forecast error covariance statistics in atmospheric variational data assimilation. II: Modelling the forecast error covariance statistics. Quart. J. Roy. Meteor. Soc., 134, 1971–1996, https://doi.org/10.1002/qj.340.
Brandt, A., 1977: Multilevel adaptive solutions of boundary value problems. Math. Comput., 31, 333–390, https://doi.org/10.1090/S0025571819770431719X.
Brandt, A., 1997: Multiscale algorithm for atmospheric data assimilation. SIAM J. Sci. Comput., 18, 949–956, https://doi.org/10.1137/S106482759528942X.
Carley, J. R., and Coauthors, 2020: A description of the v2.8 RTMA/URMA upgrade and progress toward 3D RTMA. 10th Conf. on Transition of Research to Operations, Boston, MA, Amer. Meteor. Soc., 8A.3, https://ams.confex.com/ams/2020Annual/webprogram/Paper364378.html.
De Pondeca, M. S., and Coauthors, 2011: The realtime mesoscale analysis at NOAA’s National Centers for Environmental Prediction: Current status and development. Wea. Forecasting, 26, 593–612, https://doi.org/10.1175/WAFD1005037.1.
Derber, J. C., and A. Rosati, 1989: A global ocean data assimilation system. J. Phys. Oceanogr., 19, 1333–1347, https://doi.org/10.1175/15200485(1989)019<1333:AGODAS>2.0.CO;2.
Fisher, M., 2003: Background error covariance modelling. ECMWF Seminar on Recent Developments in Data Assimilation for Atmosphere and Ocean, Reading, United Kingdom, ECMWF, 45–64.
Gaspari, G., and S. Cohn, 1999: Construction of correlation functions in two and three dimensions. Quart. J. Roy. Meteor. Soc., 125, 723–757, https://doi.org/10.1002/qj.49712555417.
Gill, P. E., W. Murray, and M. H. Wright, 1981: Practical Optimization. Academic Press, 401 pp.
Guillet, O., A. T. Weaver, X. Vasseur, Y. Michel, S. Gratton, and S. Gürol, 2019: Modelling spatially correlated observation errors in variational data assimilation using a diffusion operator on an unstructured mesh. Quart. J. Roy. Meteor. Soc., 145, 1947–1967, https://doi.org/10.1002/qj.3537.
Hackbusch, W., 2013: MultiGrid Methods and Applications. Springer, 376 pp.
Hamming, R. W., 1989: Digital Filters. 3rd ed. Dover, 284 pp.
Hayden, C. M., and R. J. Purser, 1995: Recursive filter objective analysis of meteorological fields: Applications to NESDIS operational processing. J. Appl. Meteor., 34, 3–15, https://doi.org/10.1175/1520045034.1.3.
Ide, K., P. Courtier, M. Ghil, and A. C. Lorenc, 1997: Unified notation for data assimilation: Operational, sequential and variational. J. Meteor. Soc. Japan, 75, 181–189, https://doi.org/10.2151/jmsj1965.75.1B_181.
Kang, Y.H., D. Y. Kwak, and K. Park, 2014: Multigrid methods for improving the variational data assimilation in numerical weather prediction. Tellus, 66A, 20217, https://doi.org/10.3402/tellusa.v66.20217.
Li, W., Y. Xie, Z. He, G. Han, K. Liu, J. Ma, and D. Li, 2008: Application of the multigrid data assimilation scheme to the China Seas’ temperature forecasting. J. Atmos. Oceanic Technol., 25, 2106–2116, https://doi.org/10.1175/2008JTECHO510.1.
Li, W., Y. Xie, S.M. Deng, and Q. Wang, 2010: Application of the multigrid method to the twodimensional Doppler radar radial velocity data assimilation. J. Atmos. Oceanic Technol., 27, 319–332, https://doi.org/10.1175/2009JTECHA1271.1.
Mirouze, I., and A. T. Weaver, 2010: Representation of correlation functions in variational assimilation using an implicit diffusion operator. Quart. J. Roy. Meteor. Soc., 136, 1421–1443, https://doi.org/10.1002/qj.643.
Parrish, D. F., and J. C. Derber, 1992: The National Meteorological Center’s Spectral StatisticalInterpolation analysis system. Mon. Wea. Rev., 120, 1747–1763, https://doi.org/10.1175/15200493(1992)120<1747:TNMCSS>2.0.CO;2.
Purser, R. J., 2020a: Description and some formal properties of beta filters: Compact support quasiGaussian convolution operators with applications to the construction of spatial covariances. NOAA/NCEP Office Note 498, 21 pp., https://doi.org/10.25923/qvfmjs76.
Purser, R. J., 2020b: A formulation of the Decad algorithm using the symmetries of the Galois field, GF(16). NOAA/NCEP Office Note 500, 28 pp., https://doi.org/10.25923/4nhyx681.
Purser, R. J., 2020c: A formulation of the Hexad algorithm using the geometry of the Fano projective plane. NOAA/NCEP Office Note 499, 13 pp., https://doi.org/10.25923/xrzpx016.
Purser, R. J., W.S. Wu, D. F. Parrish, and N. M. Roberts, 2003a: Numerical aspects of the application of recursive filters to variational statistical analysis. Part I: Spatially homogeneous and isotropic Gaussian covariances. Mon. Wea. Rev., 131, 1524–1535, https://doi.org/10.1175//15200493(2003)131<1524:NAOTAO>2.0.CO;2.
Purser, R. J., W.S. Wu, D. F. Parrish, and N. M. Roberts, 2003b: Numerical aspects of the application of recursive filters to variational statistical analysis. Part II: Spatially inhomogeneous and anisotropic general covariances. Mon. Wea. Rev., 131, 1536–1548, https://doi.org/10.1175//2543.1.
Putman, W., and S.J. Lin, 2007: Finitevolume transport on various cubedsphere grids. J. Comput. Phys., 227, 55–78, https://doi.org/10.1016/j.jcp.2007.07.022.
Rančić, M., M. Pondeca, R. J. Purser, and J. R. Carley, 2020: MPI redecomposition and remapping algorithms used within a multigrid approach to modeling of the background error covariance for highresolution data assimilation. 30th Conf. on Weather Analysis and Forecasting/26th Conf. on Numerical Weather Prediction, Boston, MA, Amer. Meteor. Soc., J59.3, https://ams.confex.com/ams/2020Annual/webprogram/Paper363505.html.
Ronchi, C., R. Iacono, and P. S. Paolucci, 1996: The “cubed sphere”: A new method for the solution of partial differential equations in spherical geometry. J. Comput. Phys., 124, 93–114, https://doi.org/10.1006/jcph.1996.0047.
Sadourny, R., 1972: Conservative finitedifferencing approximations of the primitive equations on quasiuniform spherical grids. Mon. Wea. Rev., 100, 136–144, https://doi.org/10.1175/15200493(1972)100<0136:CFAOTP>2.3.CO;2.
Song, H.J., I.H. Kwon, and J. Kim, 2017: Characteristics of a spectral inverse of the Laplacian using spherical harmonic functions on a cubedsphere grid for background error covariance modeling. Mon. Wea. Rev., 145, 307–322, https://doi.org/10.1175/MWRD160134.1.
Weaver, A., and P. Courtier, 2001: Correlation modelling on the sphere using a generalized diffusion equation. Quart. J. Roy. Meteor. Soc., 127, 1815–1846, https://doi.org/10.1002/qj.49712757518.
Weaver, A., and I. Mirouze, 2013: On the diffusion equation and its application to isotropic and anisotropic correlation modelling in variational assimilation. Quart. J. Roy. Meteor. Soc., 139, 242–260, https://doi.org/10.1002/qj.1955.
Wu, W.S., R. J. Purser, and D. F. Parrish, 2002: Threedimensional variational analysis with spatially inhomogeneous covariances. Mon. Wea. Rev., 130, 2905–2916, https://doi.org/10.1175/15200493(2002)130<2905:TDVAWS>2.0.CO;2.
Zhang, H. Q., and X. J. Tian, 2018: Multigrid nonlinear least squares fourdimensional variational data assimilation scheme with the Advanced Research Weather Research and Forecasting Model. J. Geophys. Res. Atmos., 123, 5116–5129, https://doi.org/10.1029/2017JD027529.
Zhang, H. Q., X. J. Tian, W. Cheng, and L. P. Jiang, 2020: System of multigrid nonlinear leastsquares fourdimensional variational data assimilation for numerical weather prediction (SNAP): System formulation and preliminary evaluation. Adv. Atmos. Sci., 37, 1267–1284, https://doi.org/10.1007/s0037602092521.