## Abstract

Fast, accurate computation of geophysical fluid dynamics is often very challenging. This is due to the complexity of the PDEs themselves and their initial and boundary conditions. There are several practical advantages to using a relatively new numerical method, the *spectral-element method* (SEM), over standard methods. SEM combines spectral-method high accuracy with the geometric flexibility and computational efficiency of finite-element methods.

This paper is intended to augment the few descriptions of SEM that aim at audiences besides numerical-methods specialists. Advantages of SEM with regard to flexibility, accuracy, and efficient parallel performance are explained, including sufficient details that readers may estimate the benefit of applying SEM to their own computations.

The *spectral element atmosphere model* (SEAM) is an application of SEM to solving the spherical shallow-water or primitive equations. SEAM simulated decaying Jovian atmospheric shallow-water turbulence up to resolution T1067, producing jets and vortices consistent with Rhines theory. SEAM validates the Held–Suarez primitive equations test case and exhibits excellent parallel performance. At T171L20, SEAM scales up to 292 million floating-point operations per second (Mflops) per processor (29% of supercomputer peak) on 32 Compaq ES40 processors (93% efficiency over using 1 processor), allocating 49 spectral elements/processor. At T533L20, SEAM scales up to 130 billion floating-point operations per second (Gflops) (8% of peak) and 9 wall clock minutes per model day on 1024 IBM POWER3 processors (48% efficiency over 16 processors), allocating 17 spectral elements per processor. *Local element-mesh refinement* with 300% stretching enables conformally embedding T480 within T53 resolution, inside a region containing 73% of the forcing but 6% of the area. Thereby the authors virtually reproduced a uniform-mesh T363 shallow-water computation, at 94% lower cost.

## 1. Introduction

Our purpose is to describe the spectral element atmosphere model (SEAM), with attention to the qualities that make it an attractive alternative to existing models. These qualities include its high-resolution accuracy, efficiency on parallel computers, and simple provision for local resolution enhancement by *local mesh refinement* (LMR). In particular, we show how LMR can be used to study regional dynamics within a global model, without the usual recourse to interpolation, boundary-value or boundary-flux fixing, etc. To help the reader judge the suitability of this model and the spectral-element method (SEM) to solve a given system of PDEs, we include most of the mathematical formulation details in appendixes or refer to appropriate literature. The mathematical formulation includes numerical interpolation and quadrature, coordinate maps, and representation of differential operators, etc. The research presented here extends the research presented by Taylor et al. (1997, 1998, hereafter TTI and TLT) and Fournier et al. (2000b), which formed the basis of recent extensions by Thomas et al. (2000, 2001) and Thomas and Loft (2002).

### a. Background

#### 1) Numerical solution of PDEs

Perhaps the most straightforward, intuitive methods to discretize and solve PDEs are finite-difference methods (e.g., Morton and Mayers 1994). Spectral methods (SMs) generally are more accurate, and their operator diagonalizations make implicit time steps more efficient (Gottlieb and Orszag 1977; Canuto et al. 1988; Boyd 2000). More generally, both these and other methods may be described in a *variational* framework, that is, as a minimization of the root-mean-square error (rmse) w.r.t. coefficients defining the approximate solution.

It is well known that in some fluid dynamics problems there is very strong interaction over a large range of spatial scales. In the case of the earth, these scales range from the planetary scale *at least* down to the mesoscale, five decades smaller. These interactions tend to be localized in physical and wavenumber space, yet most operational models tend to use uniform resolution. Increasing computer power will probably continue to make it more possible to simulate a larger range of scales using uniform resolution, but that power will also be needed for additional physical processes and parallel ensemble prediction. Ideally, we would like models to economize by devoting computations to *regions* where smaller scales play a more important role. This is the chief motivation behind LMR methods.

Geophysical problems need fields to be represented on the sphere. Spherical-harmonic functions (e.g., Mathews and Walker 1971) offer all the usual SM advantages, as well as a uniquely *isotropic* representation (i.e., a given triangular truncation of modes that approximate a spherical field and also approximate any rotation of that field with the same accuracy). The isotropy property is very important for building spherical symmetry into operators, and thus is also important for the success of sensitive calculations such as maintaining a balanced steady state. Although spherical harmonics are most efficient per digit of accuracy for globally smooth fields, to represent locally varying fields without “ringing,” they require many constructively or destructively interfering global modes, and thus may not be practically more efficient or accurate than a filtered fourth-order method (Spotz et al. 1998).

Spherical coordinates introduce additional difficulties. Nonlinear terms are most easily computed on grid points, so SM models usually transform between spectral and gridpoint representations during each time step. On a latitude–longitude mesh, curves of constant longitude *λ* converge as latitude *ϕ* approaches the poles. The zonal grid length

so the Courant–Friedrichs–Lewy stability-limited explicit time step Δ*t*_{CFL} → 0 also. We shall refer to this, and the fact that spherical metric factors ∝sec*ϕ* diverge, together as the *spherical pole problem,* part of the more general group of sphere-meshing problems.

In effecting the transform between representations, the longitudinal coordinate *λ* permits the use of the FFT, which costs *O*(*N*_{λ} log*N*_{λ}) for *N*_{λ} nodes, but the latitudinal coordinate *ϕ* does not permit such a fast transform. The associated Legendre transform from *N*_{ϕ} latitude circles normally costs *O*(*N*^{2}_{ϕ}) for *each* of the *O*(*N*_{λ}) modes on every circle. This calculation is the most expensive in GCMs at sufficiently high global resolution. Parallel implementation is doable but can be inefficient for large numbers of processing elements (PEs) (Foster et al. 1992; Foster and Worley 1997). This implementation has been accomplished recently, up to resolution T170L18 on 512 PEs (Worley 2002), T799L90 on 704 PEs (Hamrud et al. 2003), and T1279L96 on 5120 PEs (Shingu et al. 2002).

So there are currently at least three needs: locally increased resolution, avoiding pole problems, and efficient parallel implementation. In this paper we address all these needs.

#### 2) Historical approaches to locally increased resolution

Historically there have been many approaches to locally increased resolution. The best known and most flexible is LMR using the finite-element method (e.g., Morton and Mayers 1994). For example, adaptive LMR has been implemented by a multigrid method and applied by Stevens et al. (1999) to large-eddy simulation and boundary layer convection. Mashkovich (1994) described a dynamic grid adaptation based on a given measure of “local space variability of the meteorological field,” for example, geopotential Laplacian. Fox-Rabinovitz et al. (2001) created a “stretched grid” regional-climate model and carefully evaluated various dependencies of the simulation quality on stretching parameters. McGregor (1999) has another approach to grid stretching that he applied to regional-climate and tracer-transport modeling. A stretched finite-element method has been successfully applied to regional-climate modeling by Tanguay et al. (1989), albeit encumbered with spurious lateral walls. Williamson and Rosinski (2000) demonstrated a treatment of the pole problem using *reduced grids* that vary Δ*λ* with *ϕ* such that truncated spherical-harmonic series retain a specified global accuracy. Iselin et al. (2002) employ continuously moving grid points to increase local resolution and achieved 90% diffusion-error reduction for 2D linear advection of a passive tracer. They extended the method to 3D Navier–Stokes solving (Prusa et al. 2001). Recently, promising new methods have been introduced that use wavelets (Dahmen et al. 1997; Beylkin et al. 1998; Alpert et al. 2002). All of these approaches have useful qualities, but to our knowledge, few have been applied to calculations comparable in sophistication to full 3D spherical atmospheric dynamics, and of those that have been so applied, none offers all the advantages of SEM.

#### 3) General remarks about SEM

SEM was first proposed by Patera (1984). General introductions to SEM include the review by Maday and Patera (1989) and the books by Funaro (1997) and Karniadakis and Sherwin (1999). It should be noted that SEM is technically slightly less accurate than SM in terms of formal convergence estimates. However, in all our numerical experiments, including both traditional and new test cases, comparative accuracy has never indicated preferring SM over SEM. In general, differential, integral, and other operators are not diagonal in the SEM representation; rather, differential operators are block diagonal plus a global continuity constraint that amounts to a block tridiagonal of block rank one. Besides avoiding the spherical pole problem, SEM enjoys a mathematical structure that is ideal for implementation on massively parallel processing computers (see section 2c).

#### 4) Other uses of SEM

Recently SEM has been successfully applied in a number of geophysics areas (Iskandarani et al. 2002), including atmosphere (Giraldo and Rosmond 2004) and ocean (Haidvogel and Beckmann 1999) circulation and seismology (Capdeville et al. 2002; Komatitsch et al. 2002), as well as such diverse applications as flame simulation (Feng and Mavriplis 2002), mechanical vibrations (Wang and Wereley 2002), amorphous film growth (Hoppe and Nash 2002), and others. Blackburn (1998) showed that SEM may be used for large-eddy simulation. Approaches to dynamically adaptive SEM are described by Henderson (1999), Barosan et al. (2001), Fournier (2001), Feng and Mavriplis (2002), and Fournier et al. (2003). There is also a wavelet-based SEM by Canuto et al. (2000).

### b. Review of atmospheric-circulation models

#### 1) Shallow-water models

The shallow-water equations (SWEs; e.g., Pedlosky 1987) are a very useful test system for numerical methods for spherical geophysical fluid dynamics because their solutions include nonlinear effects and wave structures similar to those of the full primitive equation (PrE) system. Let units of measure be such that the earth radius, angular rotation frequency, and gas constant are all unity. Then the SWE may be written as Newton's second law:

combined with redistribution of mass geopotential Φ:

where 𝒩Φ ≡ ℱ(Φ − Φ_{s}) + *f*_{Φ} includes flux ℱΦ ≡ −∇ · Φ**u** of Φ by fluid velocity **u**, as well as effects of surface topography ∝Φ_{s}. Various standard test cases and experiments may provide physical-process parameterizations (**f**_{u}, *f*_{Φ}). Subgrid-scale processes can optionally be included by the simple turbulence closure model 𝒟 ≡ *μ*∇^{2}, although in practice we usually set the eddy diffusivity *μ* = 0. Other symbols are standard or are defined in appendix A, including relating the usual spherical coordinates (*λ, **ϕ*) to general nonorthogonal 2D coordinates **x** and defining differential operators.

#### 2) The primitive equations

We discretize the PrE in the normalized-pressure vertical coordinate *σ* ∈ [0, 1], following Williamson et al. (1987). This yields *N*_{σ} levels of stacked, coupled SWE, with surface-pressure tendency ∂_{t}*P* replacing ∂_{t}Φ:

where subscript *k* ∈ {1, … *N*_{σ}} denotes evaluation on surface *σ*_{k} of pressure *σ*_{k}*P* and thickness Δ*σ*_{k}. Vertical advection is discretized to level *k* by

Optional vertical hyperdiffusion uses a fourth difference with symmetrized boundaries:

The additional temperature field *T* has a tendency driven by diabatic heating *f*_{T}:

Geopotential Φ is diagnosed by hydrostasis:

For a vertically integrated column of fluid, mass conservation diagnoses the two vertical motions:

The boundary conditions are just *σ̇*_{1/2} = *σ̇*_{Nσ+1/2} = 0.

## 2. Numerical method

### a. Gnomonic mapping of sphere to cube

Figure 1 shows *gnomonic* (from the center, like a γνώμων = sun dial pointer) projections. Lines on the unit cube

are mapped to curves on the unit sphere {**r** : |**r**| = 1}. It is convenient to discuss these projections first at the level of cube faces *s* ∈ {0, … 5} (associated with maps *λ*_{s} in this section) and then at the level of quadrangular elements _{ℓ}, ℓ ∈ {1, … *N*^{g}_{h}} that uniformly tile the cube (maps *ϑ*_{ℓ}, section 2b). We refer the reader to appendixes B and C for mathematical details.

In Fig. 1, points **A**, **D**, and **G** indicate cube-face-tangent coordinates (*λ*_{s}, *ϕ*_{s}) [see appendix C, Eq. (C1)], for faces *s* = 0, 3, and 4, respectively. Unfolding the cube along the equator and Greenwich meridian maps the face indexes to

All lines of constant **x** component on a cube face are projected by *λ*_{s} [see appendix C, Eq. (C2)] to great-circle segments on the sphere. For example, in Fig. 1 the images in faces *s* = 0, 3, and 4 of the *x*^{1} axis are the arcs **AB**, **DE**, and **GH**, and of the *x*^{2} axis, the arcs **AC**, **DF**, and **GI**, respectively. The cube edges are mapped by *λ*_{s} to the Fig. 1 curves as detailed in appendix C. The cube-edge curves are punctuated by the **x**-coordinate-line cusps in the (*λ, **ϕ*) maps (as will be seen in the left panels of Fig. 10 in section 3). Alternative maps are described by Purser and Rančić (1998) and references therein.

### b. Gauss–Lobatto–Legendre interpolation and quadrature

The next mathematical tools that must be employed to implement SEM are Gauss–Lobatto–Legendre collocation and quadrature. These are built using the nodes *ξ*_{j} and interpolating polynomials *ϕ*_{j}(*ξ*) (appendix B), shown in Fig. 3 for degree *p* = 7. Let _{j} denote the space of polynomials of degree ≤*j*; then, an important classical result is that continuum integrals of the form

may be evaluated *exactly* by the discrete sum

over the nodes, provided that the integrand satisfies the condition *u* ∈ _{2p−1}. In section 2c, this result leads to a *collocation* method, that is, an approximation of the continuum equations by equations solved only at the *ξ*_{j}. Conversely, using the interpolation operation ℐ*u* (appendix B), one estimates *u*(*ξ*) at arbitrary *ξ* ∈ [−1, 1], given only the node values *u*(*ξ*_{j}). Derivatives are evaluated at the nodes by Eq. (B2) (see appendix B)

In 2D, interpolation commits an error at any point ** ξ**, due to the parts ∉

_{p}of

*u*in both directions, given classically by

where the 1D residual operator is

and ** ξ**′ and

**″ are two**

*ξ***-dependent points in the standard element ≡ {**

*ξ***:**

*ξ**ξ*

^{α}∈ [−1, 1]} ≡ [−1, 1]

^{2}(Isaacson and Keller 1994, section 6.6).

We then use maps *ϑ*_{ℓ} to the “physical” cube-face coordinates **x** from the “standard” coordinates ** ξ**, as described in appendixes B and C. Each map

*ϑ*_{ℓ}is defined by a quadrangular element

_{ℓ}; that is, by its four corners located at

**x**=

**c**

_{ℓ,k},

*k*∈ {0, … 3}. The global index ℓ ∈ {1, …

*N*

^{g}

_{h}} labels elements. Taking uniform subdivision of each cube face

*s*as an example, then ℓ runs over rows and columns of length

*N*

_{h}, the number of elements along each cube edge: ℓ = 1 +

*l*

_{row}+

*N*

_{h}

*l*

_{col}+

*N*

^{2}

_{h}

*s.*In Fig. 1 the points

**A**,

**D**, and

**G**are the centers of faces

*s*= 0, 3, and 4 and (since

*N*

_{h}= 4) are the center cross points of sets

of element indexes ℓ, respectively.

### c. Integral form and collocation of the governing equations

A *variational* method consists of minimizing rmse, in the form of certain domain integrals, w.r.t. the unknown fields on which the integrands depend. Specifically, an integrand ɛ^{2}_{h,p} is the squared residual error between evaluations at true fields and at fields computed with polynomials of degree *p* in elements of size *h.* As we now show, SEM enables a variational method to be coded as a collocation method, potentially with *spectral accuracy* (rmse decaying with increasing *p exponentially,* i.e., faster than any power *p*^{−q}) for sufficiently smooth true solutions. For example, if ɛ_{h,p} is the SEM deviation from true solution *u* of the 1D forced Helmholtz equation with frequency Λ, then it is bounded as

for some constant *C*_{q} (e.g., Karniadakis and Sherwin 1999, section 2.3.6), assuming that the forcing *f* is bounded as ‖(*d*/*dx*)^{q−1}*f*‖_{2} < ∞. The exponent *q* measures *u* smoothness and *q* = ∞ if *u* and *f* are analytic, in which case rmse decays exponentially. De Frutos and Novo (2000) extended a similar result to the case of incompressible Navier–Stokes equations.

To derive the SEM collocation, the governing equations (1)–(2) or (3)–(8) are all multiplied by global *test functions **υ*(*λ, **ϕ*) and integrated in the horizontal, with two changes of variables defined in appendix C. The first change of variables is to (*λ, **ϕ*) by *λ*_{s} from coordinates **x** in cube face *s.* The second is to **x** in each element _{ℓ′}, by *ϑ*_{ℓ′} from the standard coordinates ** ξ** ∈ . In other words, taking the term

**a**to represent any expression, then

**a**and the map compositions

**a**∘

*λ*_{s}and

**a**∘

*λ*_{sℓ′}∘

*ϑ*_{ℓ′}represent the same expression with arguments (

*λ,*

*ϕ*),

**x**, and

**, respectively, where**

*ξ**s*

_{ℓ}denotes the cube face containing element ℓ. The integral form of any equation is partitioned into sums over six cube faces, then sums over the

*N*

^{g}

_{h}elements:

introducing Jacobian factors into the integrands for both changes of variables.

Now let *υ* → *υ*_{n,ℓ} ∈ ^{0} be continuous over the whole sphere and vanish at every node except node *n* in element ℓ, denoted by **λ**_{n,ℓ} ≡ **λ**_{sℓ}(**x**_{n,ℓ}), and possibly at identical nodes **λ**_{n′,ℓ′} = **λ**_{n,ℓ} in abutting elements ℓ′ ≠ ℓ. To make this concrete, let us briefly consider the case of 1D intervals [*c*_{l}, *c*_{l+1}[ with maps *x* = *ϑ*_{l}(*ξ*) ≡ 2^{−1}[*c*_{l} + *c*_{l+1} + (*c*_{l+1} − *c*_{l})*ξ*]. The desired 1D test function *υ*_{j,l}(*x*) is given by *υ*_{j,l} ∘ *ϑ*_{l′} = *ϕ*_{j}*δ*_{l,l′} for *j* ∉ {0, *p*}, and since *ϕ*_{j}(*ξ*) = *ϕ*_{p−j}(−*ξ*), ensuring *υ*_{j,l} ∈ ^{0} simply amounts to prepending west-edge reflections *υ*_{0,l} ∘ *ϑ*_{l′} = *ϕ*_{p}*δ*_{l−1,l′} + *ϕ*_{0}*δ*_{l,l′}, or appending east-edge reflections *υ*_{p,l} ∘ *ϑ*_{l′} = *ϕ*_{p}*δ*_{l,l′} + *ϕ*_{0}*δ*_{l+1,l′}. To generalize this 1D reflection to the cubed sphere, there are two cases to consider: *boundary nodes* on the four element edges,

and *internal* nodes. If **x**_{n,ℓ} is an internal node then it is unique, and *υ*_{n,ℓ} ∈ ^{0} is an interpolating-polynomial tensor product inside _{ℓ} and vanishes outside _{ℓ}:

where *N*_{p} is the number of nodes along an element edge. But if **x**_{n,ℓ} is on the boundary of *N*_{b} neighboring elements with connectivity index sets _{b} (containing nodes that are shared) and _{b} (containing elements that share those nodes) as described in appendix D, then in order to ensure *υ*_{n,ℓ} ∈ ^{0}, *υ*_{n,ℓ} is also nonzero in all those neighbors,

Inserting such a test function *υ*_{n,ℓ} and evaluating the integrals by the quadrature (9) eliminates coupling between distinct nodes and so generates a diagonal mass matrix *δ*_{n,n′}*δ*_{ℓ,ℓ′}*M*_{n,ℓ} (appendix C). Then terms such as (13) reduce to *M*_{n,ℓ}**a**(*λ*_{n,ℓ, }*ϕ*_{n,ℓ}), so effectively we arrive at the original dynamics (1)–(2) or (3)–(8) *collocated* at discrete nodes *λ*_{n,ℓ}. Of course there is a quadrature-error term such as (11) due to the part ∉_{2p−1} of each integrand of (13).

Continuity is enforced at the *b*th of all the *N*_{bn} shared boundary nodes, using the element-connectivity sets _{b}, _{b}, by applying the continuity-projection operation Φ → 𝒫_{0}Φ, defined by

The connectivity is also input to the software package METIS (http://www-users.cs.umn.edu/~karypis/metis) in order to allocate spectral elements to PEs.

### d. Explicit time stepping and spatiotemporal filtering

Referring to (2) as a paradigm for the complete dynamics, SEAM employs second-order leapfrog time stepping for the nondiffusive nonlinear forced part (𝒩) and forward Euler *if* there is horizontal or vertical diffusion (𝒟):

Here the subscript R denotes a 2Δ*t* Robert filter,

and BV denotes *τ*_{f}-periodic Boyd–Vandeven filtering [see appendix B, Eq. (B3)],

This filtering controls both instabilities due to aliasing or possibly other nonlinear effects and any spurious modes. It acts by progressively damping components of higher Legendre spectral index. See appendix B or TTI for more details.

### e. Implementing LMR

SEAM offers a major advantage by enabling LMR. Another strength is that LMR entails *no* code modifications other than reconstructing elements and a one-time PE reallocation using METIS. LMR is implemented as follows. One defines *refinement* matrices 𝗥_{m,d} that transform the square-element matrix 𝗰_{ℓ} (as defined in appendix C) to several irregular quadrangles 𝗰_{ℓ}𝗥_{m,d} for *m* ∈ {1, … *L*_{d}} by subdividing each element edge *k* into *d*_{k} equal pieces. The refinement satisfies

(neglecting shared boundaries) for any (ℓ, **d**), so that every **x** is still contained exactly once. Note that *N*^{g}_{h} is increased by *L*_{d} − 1 for each element to be refined, and that locally *h* and therefore globally Δ*t*_{CFL} are reduced by roughly a factor of

typically ∈{½, ⅓}. The overall effect resembles *picture framing,* which is what we call it. It is a simple first approach to conformal LMR, and is only *adaptive* to the extent that the user exploits a priori information about localized forcing, topography, etc. For production codes we would suggest investigating the more sophisticated packages such as CUBIT (http://endo.sandia.gov/cubit). For nonconformal LMR, we refer to Funaro (1997), Karniadakis and Sherwin (1999), Fournier (2001), Feng and Mavriplis (2002), Fournier et al. (2003), and references therein. The latter two documents also include discussions of *dynamic adaptivity. *Table 1 lists the 𝗥_{m,d} matrices for the refinements discussed in section 3d.

### f. Performance estimates for the method

SEAM's high parallel performance is due to the connectivity structure (which simplifies PE communication) and the explicit time stepping (which requires no operator inversions). Only data on the element boundary nodes that are shared between PEs (identical physical locations mapped to elements on different PEs) need to be communicated between those PEs. This occurs in (14). All computations on individual elements are just size *N*^{2}_{p} dense matrix multiplies, such as (10) or (15), etc., that easily fit in a modern PE cache. Thus we have very low surface-to-volume (shared-to-local) data ratio, which is very efficient. Finite-difference schemes can also be very efficient but are usually low order, whereas SEM affords very high order *p* (up to *p* = 31 in TTI). We note that *compact-difference* schemes (e.g., Chang and Shirer 1985) also afford somewhat high order (typically fourth or sixth), using stencils that can improve parallel efficiency (e.g., Dixon and Tan 2003).

Upon model initialization, the Legendre-transform representations such as (10) or (15) are built into the operators that act on function values at collocation points *λ*_{n,ℓ}. Unlike the most common parallel implementations of SM models, there are no run-time transformations between point values and spectral coefficients, thus, no data transpositions between coordinates in different directions and no global communication for that purpose. This is another factor contributing to SEAM's relative efficiency.

The stability-limited time step Δ*t*_{CFL} = *O*(*N*^{−1}_{h}*N*^{−2}_{p}). This estimate includes the contributions to spatial discretization length that come from element size *h* (*N*_{h} ≡ 2*h*^{−1}) and from node clustering (*N*_{p} = *p* + 1). Computation costs *O*(*N*^{2}_{h}*N*^{3}_{p}) per time step, including *O*(*N*^{2}_{h}) to touch every element once and *O*(*N*^{3}_{p}) for matrix operations in any element. Thus, as a trend, “*h* refinement” (decreasing *h*; algebraic convergence) costs less and constrains Δ*t*_{CFL} less than does “*p* refinement” (increasing *p*; exponential convergence), referring to (12). However, as observed by an anonymous reviewer, *p* refinement can be better, depending on solution properties and desired accuracy. SWE experiments by TTI show that the greatest computational efficiency is obtained for *N*_{p} ∈ {8, … 16}, and *N*_{h} increased as needed for accuracy. In contrast, the cost for SM models with *O*(*N*_{s}) zonal and *O*(*N*_{s}) polar wavenumbers is *O*(*N*^{3}_{s}). If *N*_{p} is fixed, then at high-enough resolution SEM can cost less per degree of freedom (dof) than do fully SM models. For example, taking into account a fully SM model's larger Δ*t* (apparently a factor of 10 for *N*_{p} = 8) and all prefactors, the complexity analysis by TTI shows that the SEM solver becomes more cost effective per grid point at resolutions T ≥ 169. The larger Δ*t* for semi-implicit stepping was a motive for the improvements by Thomas et al. (2000, 2001) and Thomas and Loft (2002).

## 3. Numerical experimental results

### a. High-resolution Jupiter SWE simulation

Using the SWE version of the SEAM, we carried out integrations of decaying turbulence on Jupiter with very high resolution and very weak dissipation (Baer et al. 2001). Four different truncations are used ranging from T_{e}171 to a maximum of T_{e}1067, where in order to compare with SM models, we let T_{e} denote the roughly equivalent dealiased triangular spectral resolution of SEAM. (The value T_{e} counts the 4*N*_{h}*N*_{p} equatorial nodes and ÷3; thus, T_{e}1067 represents 6*N*^{2}_{h} = *N*^{g}_{h} = 60 000, *N*^{2}_{p} = 8 × 8, that is, 3200 equatorial latitude-circle points.) SEAM ran very efficiently on the Cray T3E with 128 PEs (Table 2) and at that resolution produced multiple jets from pole to pole as anticipated by Rhines. The intensity of the jets, as well as the developed vortices, may be seen in Fig. 4. The jet-streak number and vortex intensity increase as resolution increases. It was encouraging to find from these initial experiments that SEAM is capable of solving SWE efficiently at exceptionally high resolution on a massively parallel computer. Indeed there are few if any reports of other global models capable of running at the high resolution reported here (Iacono et al. 1999; Yoden et al. 1999; Shingu et al. 2002).

### b. SEAM PrE trials with Held–Suarez forcing

After the SWE experiments reported by TTI and Baer et al. (2001), SEAM has also been applied to PrE, as described by TLT and Fournier et al. (2000a,b). The first step in validating a PrE solver as a prospective GCM dynamical core (dycore) is to pass the test of Held and Suarez (1994, hereafter HS), as reported for SEAM by TLT. SEAM results, which derive from final 1000-day time averages from 1200-day runs, are shown in Figs. 5 and 6. We present these diagnostics mainly to validate SEAM by showing it produces virtually identical results to those of a spherical-harmonic SM (SHSM) model. This is not surprising since the models use a similar vertical discretization, and the horizontal discretizations have been shown by TTI to agree to several digits.

Figure 5 shows the eddy kinetic energy as a function of zonal resolution for SEAM for a gridpoint model and a SHSM model. For the SHSM and gridpoint models the results were taken from HS, which contains a more complete description of them. One can see that the SM model and SEAM have very similar results.

Figure 6 shows the zonal mean wind (top) and temperature eddy variance (bottom) from SEAM. Both of these plots are shown for the gridpoint model and SM model in HS. Consulting that paper shows that all the models produce similar results. We conclude that the SM model and SEAM results are almost identical and differ slightly from the results of the finite-difference model.

### c. Experimental computational performance results

One advantage of the dycore benchmark of HS over the shallow-water test cases is that it represents a complete component of a GCM. It allows us to compute meaningful performance measures since the efficiency of a dycore on this benchmark directly measures how efficient it will be when inserted into a GCM. For all these tests SEAM uses *N*_{σ} = 20 vertical levels (denoted by L20), and all cases use an *N*^{2}_{p} = 8 × 8 quadrature mesh within each element, corresponding to local spectral representation up to degree *p* = 7. For this configuration SEAM uses less than 1 kbyte of cache per dependent variable, per element, per level. The global resolution varies only with *N*^{g}_{h}.

Various supercomputer systems are described in Table 2. Parallel performance on these systems may be quantified in terms of the *efficiency **E*_{f,I} in increasing the rate *R*(*I*) in Gflops when the number *I* of PEs is scaled up by a factor *f* > 1. We may conveniently define it as *E*_{f,I} ≡ *R*(*f**I*)/*fR*(*I*).

In Fig. 7a, we present the total Gflops rates obtained for T_{e}171L20 (77_{e} km) or T_{e}533L20 (24_{e} km) resolution on up to 1024 PEs, where subscript e on a value denotes resolution length estimated by the average node separation:

The results of Schaffer and Suárez (2000) are inserted for comparison. At resolution T_{e}533L20, using IBM POWER3 PEs, SEAM achieves a computation rate of *R*(1024) = 130 Gflops with 17 spectral elements allocated per PE. This corresponds to 8% of peak Gflops and efficiency *E*_{64,16} = 48%. The scalability and efficiency of the method is better illustrated in Fig. 7b by *R*(*I*)/*I,* the Mflops per PE. At resolution T_{e}171L20, SEAM achieves 292 Mflops per PE (29% peak) on 32 Compaq ES40 PEs (*E*_{32,1} = 93%) with 49 elements allocated per PE. We also may infer from TLT that *R*(64) on the HP SPP2000 varied by ≤19% for resolutions 2 or 4 times coarser than T_{e}171, which is too little variance to show clearly in Fig. 7. Overall, SEAM Mflops per PE rates and model-state size are comparable to the results of Anderson et al. (1999) and Gropp et al. (2001), who use unstructured tetrahedral control volumes to implicitly solve for Euler flow.

More important than Mflops is the actual time to solution. These numbers are given in Fig. 7c. Compared to other models, SEAM at resolution 77_{e} km on the SGI Origin 2000 is slightly faster per dof than the speed reported by DAO (2000), who discuss runs at roughly 250-km resolution on the SGI Origin 2000, of the model by Lin and Rood (1999). Also on this computer, SEAM's *E*_{2,16} = 90% is about 15 points above the efficiency inferred from DAO (2000). Figure 7c shows that SEAM performs at least 13 times faster per level on thirty-two 200-MHz IBM CPUs than do the National Center for Atmospheric Research (NCAR) Community Atmosphere Model 2.0 (CAM2.0) dycores (Eulerian, semi-Lagrangian, and finite-volume) and the Globel Model Europa (GME) dycore, on forty 375-MHz IBM CPUs at similar high horizontal resolutions. It should be noted that the three NCAR CAM2.0 dycore and the GME dycore runs were only compiler optimized and that their timings include initialization and input/output and hence are only representative (C. Jablonowski 2003, personal communication).

Table 3 shows *E*_{2,I} for all the Fig. 7 SEAM data. For the T_{e}533L20 problem using IBM POWER3 PEs, note that *E*_{2,I} decreases from 98% gradually to 69% as *I* increases from 64 to 512, indicating very good performance for a fixed-size problem. For T_{e}171L20 on every system, the performance *E*_{2,I} ≥ 78%.

Figure 8 shows the percentage of total wall clock time spent on message passing between PEs of the HP SPP2000 (Table 2), for three different horizontal resolutions: *N*^{g}_{h} = 96, 384, and 1536 elements, equivalent to spectral resolutions of T_{e}43, T_{e}85, and T_{e}171, and average grid spacing of 319_{e}, 157_{e}, and 77_{e} km, respectively. As expected, message passing takes a steadily greater proportion of time as we keep the resolution (problem size) fixed while using more and more PEs. However, message passing also takes a lesser proportion of time as we increase the problem size while using a fixed number of PEs. This is due to smaller surface-to-volume ratio of data to be communicated, since each PE acquires more spectral elements as resolution is increased. The higher-resolution curves are close together, suggesting that performance becomes less sensitive to *N*^{g}_{h}. As observed by an anonymous reviewer, this implies that SEAM's scalability is a simple function of *N*^{g}_{h}*N*^{2}_{p}*N*_{σ} and communication costs.

### d. Static LMR in SWE

The shallow-water test cases (Williamson et al. 1992) were used to demonstrate the accuracy and efficiency of SEAM, as reported by TTI. Another success of SEAM is to incorporate LMR, presented here.

Figure 9a (left) shows the simulation of test case 4 (“forced nonlinear system with a translating low”) of Williamson et al. (1992) using a uniform mesh with *N*^{g}_{h} = 150 (resolution T_{e}53, 242_{e} km). One way to correct the errors (right) near the low pressure feature is to uniformly triple the resolution, yielding 1350 elements (T_{e}160, 81_{e} km), as shown in Fig. 9b. A more cost-effective way (Fig. 9c) is to use a *locally refined **N*^{g}_{h} = 442 element mesh, whose resolution varies between those of the other two meshes, using successive 300% local stretching factors. As seen in Table 4, a small loss of accuracy can be tolerated to compute the solution with the same Δ*t* and a savings in effort in proportion to the numbers *N*^{g}_{h} of elements, or 442/1350 ⇒ 67% savings.

Another time to use LMR is when localized topography is important. In Fig. 10a (center) we show SWE flow forced by a smooth topographic feature that is 76% (by mass) spatially localized to 6% of the sphere (Williamson et al. 1992, their test case 5: “zonal flow over an isolated mountain”). Note that many finescale solution gradients have been corrupted by ringing (spectral-truncation oscillations) due to insufficient resolution. These spurious oscillations weakly persist at higher resolution (Fig. 10b). By locally refining the element mesh (Fig. 10c), nearly the same accuracy solution (Table 4) as a highest-resolution computation (Fig. 10d) can be computed with only 310/(6 × 34^{2}) ≈ 1/22 the number of elements and 3/4 the time step (Fournier et al. 2000b). This implies only 1/17 the computational cost, or a 94% savings. Additionally there is no degradation of the global solution, as can be seen in Figs. 10e–h.

Figure 9c also illustrates the use of the refinement matrices 𝗥_{m,d}. The refinement from the tropical to the extratropical zone uses a zone with *L*_{3,2,1,2} = 7 refined elements dividing each length *h* from the south (*k* = 2) by *d*_{0} = 3 going north and *d*_{1} = *d*_{3} = 2 going east and west. There follows a uniformly refined zone of *L*_{3,3,3,3} = 9 to the north. The less-resolved polar zone is alternately connected using the same *L*_{3,2,1,2} toward the cube edges, or else *L*_{3,3,2,2} = 6 refined elements toward the cube corners. Finally, in Figs. 10c and 10g we use these refinements again, in addition to a refinement by *L*_{2,2,1,1} = 3 elements.

## 4. Concluding remarks

We conclude by summarizing the main features of the spectral-element method and its application to the Spectral Element Atmosphere Model (SEAM).

SEAM's cubed-sphere coordinates eliminate the pole problem (Fig. 1). That is, there are no mesh-point clustering, worsening Courant–Friedrichs–Lewy stability restriction, or coordinate singularities as latitude |

*ϕ*| →*π*/2, as there are in explicit spectral-transform codes on latitude–longitude grids.SEAM produces ultra-high-resolution shallow-water simulation of Jupiter's atmosphere (Fig. 4).

SEAM validates the dynamical-core tests of Held and Suarez (1994) (Figs. 5 and 6).

The subdivision of the problem domain across elements lends itself naturally to efficient parallel computation when using explicit time stepping. SEAM uses less than 1 kbyte of processing element (PE) cache per dependent variable, per element, per level. Solving the primitive equations at resolution T171L20, SEAM achieved 292 Mflops (10

^{6}floating-point operations per second) per PE (29% peak) on 32 Compaq ES40 PEs (efficiency*E*_{32,1}= 93%) with 49 elements allocated per PE (Fig. 7b). At T533L20, SEAM hit 130 Gflops (8% peak) on 1024 IBM POWER3 PEs (efficiency*E*_{64,16}= 48%) with 17 spectral elements allocated per PE (Fig. 7a). SEAM compares well against several other dynamical cores w.r.t. wall clock time per model day (Fig. 7c).Local mesh refinement is practical and can afford significant computational savings, while focusing higher resolution on regions of dynamic and/or climatological interest (Figs. 9 and 10).

It is not easy to construct semi-implicit or semi-Lagrangian time-stepping algorithms that are efficient on parallel machines, but this has been recently accomplished by our collaborators Thomas and Loft (2002). Their semi-implicit scheme used a Helmholtz solver to time march 30 decoupled shallow-water layers roughly twice as quickly as did their explicit scheme, at ≈T85 on up to 384 IBM-SP PEs, although the average solver iteration count grew by ≈40% as resolution was doubled.

Most recently, H. Wang et al. (2003, unpublished manuscript) are coupling SEAM to the NCAR Community Climate System Model.

## Acknowledgments

We thank C. Jablonowski, M. J. Suárez, and P. H. Worley for providing and discussing data and two anonymous reviewers for thorough reading and many useful comments. Our thanks go to F. Baer for guidance and support and to S. Levis for etymological assistance. This research was supported by the Office of Science (BER), U.S. Department of Energy, Grants DEFG0298ER62612 and DEFG0201ER63247 to the University of Maryland. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract DE-AC03-76SF00098. Acknowledgment is made to the National Center for Atmospheric Research for computing time used in this research. The National Center for Atmospheric Research is operated by the University Corporation for Atmospheric Research under sponsorship of the National Science Foundation. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX A

#### Curvilinear Coordinates and Operators

Spherical coordinates are longitude *λ* = *λ*^{1} ∈ [0, 2*π*[ and latitude *ϕ* = *λ*^{2} ∈ [−*π*/2, *π*/2]. The 3D eastward, northward, and vertical (spherical normal) unit vectors are

where ≐ is read “is represented by.” Two-dimensional coordinates in the standard square are given by **x** ≐ (*x*^{1}, *x*^{2})^{T} ∈ ≡ {** ξ** :

*ξ*

^{α}∈ [−1, 1]} ≡ [−1, 1]

^{2}. On cube face

*s*the map (

*λ*

^{1},

*λ*

^{2}) =

*λ*_{s}(

**x**) (appendix C) induces two biorthogonal bases tangent to the sphere:

(assuming summation convention over duplicated Greek coordinate indexes *α, **β* ∈ {1, 2}), where ∂_{β} ≡ ∂/∂*x*^{β} and *J*_{α,β} ≡ |∇*λ*^{a}|^{−1}∂_{β}*λ*^{a} (for index *a* ∈ {1, 2}) is the rescaled Jacobian. *Biorthogonal* means **g**^{α} · **g**_{β} = *δ*^{α}_{β}. The Jacobian's inverse is *J*^{−1}_{β,α} = |∇*λ*^{a}|(∂*x*^{β}/∂*λ*^{a}), and its determinant is

The metric tensor is

with inverse

and determinant *J*^{2}. The Christoffel symbol is

The wind velocity, or any arbitrary vector, tangent to the sphere, may be expressed as **u** = *u*^{λ}**e**^{1} + *u*^{ϕ}**e**^{2} = *u*_{α}**g**^{α} = *u*^{β}**g**_{β}. Here *u*_{α} ≡ **u** · **g**_{α} = *g*_{α,β}*u*^{β} is the covariant and *u*^{β} ≡ **u** · **g**^{β} = *g*^{α,β}*u*_{α} is the contravariant **u** component, with transformation law (sec*ϕu*^{λ}, *u*^{ϕ}) = (*u*^{1}, *u*^{2})(∂/∂**x**)** λ**. Similarly

*u*

^{λ}≡

**u**·

**e**

^{1}=

*J*

_{1,β}

*u*

^{β}is the longitudinal and

*u*

^{ϕ}≡

**u**·

**e**

^{2}=

*J*

_{2,β}

*u*

^{β}is the latitudinal

**u**component, with covariant components

**u**· ∂

_{λ}

**e**

^{3}= cos

*ϕu*

^{λ}and

**u**· ∂

_{ϕ}

**e**

^{3}=

*u*

^{ϕ}and contravariant components

**u**· ∇

*λ*= sec

*ϕu*

^{λ}and

**u**· ∇

*ϕ*=

*u*

^{ϕ}.

The horizontal (sphere tangent) gradient operator is **g**_{α}∇^{α} ≡ ∇ = **e**^{1} sec*ϕ*∂_{λ} + **e**^{2}∂_{ϕ}, where ∇^{α} ≡ *g*^{α,β}∂_{β}. Thus one defines the vertical (sphere normal) vorticity or **u** curl (a pseudodivergence) to be

where **e**^{3} × **u** = *J*^{−1}*u*_{β}ɛ^{β,α}**g**_{α}. Similarly one defines **u** divergence as

The final expression equivalents are in spherical coordinates. In SEAM, the expressions in **x** coordinates involving metric terms are used.

### APPENDIX B

#### Polynomial Spaces and Operators

For *ξ* a standard coordinate ∈ [−1, 1], let

denote the Legendre polynomial of degree *j.* The span of L_{j} for all *j* ∈ {0, … *p*} is a polynomial space of dimension *N*_{p} ≡ *p* + 1. The quadrature formula (9) employs the Gauss–Lobatto node [−1, 1] *ξ*_{j} = −*ξ*_{p−j} = *j*th-greatest root of (1 − *ξ*)(1 + *ξ*)(*d*/*dξ*)L_{p} and the Gauss–Lobatto weight *w*_{j} ≡ 2(*pN*_{p}L^{2}_{p,j})^{−1} (Boyd 2000; Canuto et al. 1988). Then corresponding to *ξ*_{j}, the cardinal function

is alternately given by

where *γ*_{j} ≡ 〈L^{2}_{j}〉^{−1}_{GL}. Using (9), one constructs the interpolation operator as

The derivative matrix is

where L^{′}_{j,j′} ≡ (*d*/*dξ*)L_{j}(*ξ*_{j′}) ∈ 2^{−1}*j*(*j* + 1)[−1, 1]. The Boyd–Vandeven filter matrix of order 12 and length *N*_{p} − *j*_{f} = max(3, ⅔*N*_{p}) is given by

where *Z*_{q}(Ω) ≡ [ln(1 − Ω^{2})^{−q}]^{1/2} sgnΩ. This expression is just the representation at nodes *ξ*_{j} of the operation of attenuating Legendre coefficients of degree ≥*j*_{f} by an effective viscosity *μ*_{f} times the bracketed expression (∈[0, 1]).

Let ** ξ** ≡ (

*ξ*

^{1},

*ξ*

^{2})

^{T}be the 2D coordinate in . At the 2D node index

*n*∈ {0, …

*N*

^{2}

_{p}− 1} one has the 2D Gauss–Lobatto node

*ξ*_{n}given by

*ξ*_{j+Npj′}≡ (

*ξ*

_{j},

*ξ*

_{j′})

^{T}.

### APPENDIX C

#### Construction of the Element Mesh

Each 1D cube edge is subdivided into *N*_{h} ≡ 2*h*^{−1} elements of length *h,* with lower element edges located at *c*_{l} ≡ tan[(*π*/4)(*hl* − 1)] ∈ [−1, 1[ for *l* ∈ {0, … *N*_{h} − 1}. Then for a uniform 2D mesh, element ℓ ∈ {1, … *N*^{g}_{h}} (where the global number of elements *N*^{g}_{h} = 6*N*^{2}_{h}), corner *k* ∈ {0, … 3} is located at **c**_{ℓ,k} which represents the columns of

Now introduce the bilinear map

from to the quadrangle with corners **C**_{k}. Generally the inverse map *ϑ*^{−1}(𝗖, **x**) is not bilinear and involves square roots. In the rectangular, linear case

may be inverted trivially. Thus one defines the ℓth element _{ℓ} ≡ *ϑ*_{ℓ}() with four boundaries

where *ϑ*_{ℓ}(** ξ**) ≡

**(𝗰**

*ϑ*_{ℓ},

**) and has four boundaries**

*ξ*Define longitude and latitude

at which the sphere is tangent to the center of cube face *s.* Then the local 3D Cartesian basis vectors are just

Vector **e**^{ι}_{s} is parallel to cube face *s* for *ι* ≤ 2, and **e**^{3}_{s} is perpendicular. These define 3D location

on cube face *s* and distance

from the sphere center. Then the maps *λ*^{−1}_{s} and *λ*_{s} are given by

respectively, where **x**_{n,ℓ} ≡ *ϑ*_{ℓ}(*ξ*_{n}) ∈ _{ℓ} is the global node and *s*_{ℓ} denotes the cube face containing element ℓ.

The cube edges {**x** : *x*^{2} = ±1(|*x*^{1}| ≤ 1)} are mapped by *λ*_{s} to the curves

while the edges {**x** : *x*^{1} = ±1(|*x*^{2}| ≤ 1)} go to the curves

for *s* < 4 (“equatorial faces”) and to the curves

for *s* ≥ 4 (“polar faces”).

The value of the metric tensor at **x**_{n,ℓ} is 𝗴_{n,ℓ} ≡ (𝗝_{n,ℓ})^{T}𝗝_{n,ℓ}, where

are the rescaled Jacobians at **x**_{n,ℓ}. Enforcing continuity by (14) employs the mass matrix *M*_{n,ℓ} ≡ *w*_{j}*w*_{j′}*J*_{n,ℓ}|∂*ϑ*_{ℓ}(*ξ*_{n})/∂** ξ**| for

*n*=

*j*+

*N*

_{p}

*j*′, including determinants of mappings Jacobians.

### APPENDIX D

#### Element-Connectivity Notation

First make a correspondence between 2D indexes *n* and nodes on the four element boundaries:

Then obtain 3D distance squared between spherical points as

We can thereby define element-connectivity index sets

(both of size *N*_{b}) by a maximum 3D distance Δ(*λ*_{n,ℓ}, *ϕ*_{n,ℓ}, *λ*_{n′,ℓ′}, *ϕ*_{n′,ℓ′}) ≤ 10^{−9} for all (*n, **n*′) ∈ ^{2}_{b}, (ℓ, ℓ′) ∈ ^{2}_{b}, and ℓ ≠ ℓ′, where the number of elements sharing the *b*th boundary node is

These relationships bear a unique global boundary-node index *b* ∈ {1, … *N*_{bn}}, where the global number of possible element-boundary nodes on the cubed sphere is *N*_{bn}[≡2 + (2*p* − 1)*N*^{g}_{h} for uniform subdivision].

## Footnotes

*Corresponding author address:* Dr. Aimé Fournier, NCAR, P.O. Box 3000, Boulder, CO 80307-3000. Email: fournier@ucar.edu

^{*}

The National Center for Atmospheric Research is sponsored by the National Science Foundation.