## 1. Introduction

Numerical weather prediction (NWP) requires an answer in real time with a window of approximately 1 h to run a medium-range global forecast that can be delivered in time to its customers. While computational efficiency remains one of the most pressing needs of NWP, there is an open question about how to make the most efficient use of the affordable computer power that will be available over the next decades, while seeking the most accurate forecast possible.

The spectral transform method has been successfully applied at the European Centre for Medium-Range Weather Forecasts (ECMWF) for approximately 30 years, with the first spectral model introduced into operations at ECMWF in April 1983. Spectral transforms on the sphere involve discrete spherical harmonics transformations between physical (gridpoint) space and spectral (spherical harmonics) space. The spectral transform method was introduced to NWP following the work of Eliasen et al. (1970) and Orszag (1970), who pioneered the efficiency obtained by partitioning the computations. One part of the computations is performed in physical space, where products of terms, the semi-Lagrangian or Eulerian advection, and the physical parameterizations are computed. The other part is solved in spectral space, where the Helmholtz equation arising from the semi-implicit time-stepping scheme can be solved easily and horizontal gradients on the (reduced) Gaussian grid are computed accurately, particularly the Laplacian operator that is so fundamental to the propagation of atmospheric waves. The success of the spectral transform method in NWP in comparison to alternative methods has been overwhelming, with many operational forecast centers having made the spectral transform their method of choice, as comprehensively reviewed in Williamson (2007).

A spherical harmonics transform is a Fourier transformation in longitude and a Legendre transformation in latitude, thus keeping a latitude–longitude structure in gridpoint space. The Fourier transform part of a spherical harmonics transform is computed numerically very efficiently by using the fast Fourier transform (FFT; Cooley and Tukey 1965), which reduces the computational complexity to ∝ *N*^{2} log*N*), where *N* symbolizes the cutoff spectral truncation wavenumber. However, the conventional Legendre transform has a computational complexity ∝ *N*^{3}), and with increasing horizontal resolution the Legendre transform will eventually become the most expensive part of the computations in terms of the number of floating-point operations and subsequently the elapsed (wall clock) time required. Because of the relative cost increase of the Legendre transforms compared to the gridpoint computations, very high-resolution spectral models are believed to become prohibitively expensive, and methods based on finite elements or finite volumes on alternative quasi-uniform grids covering the sphere are actively pursued (e.g., Staniforth and Thuburn 2012; Cotter and Shipton 2012).

In this article the implementation of a fast Legendre transform (FLT) into the parallel spherical harmonics transform package of ECMWF is described. A full description of the Integrated Forecast System (IFS) parallelization scheme is contained in Barros et al. (1995).

Legendre transforms involve sums of products between associated Legendre polynomials at given Gaussian latitudes and corresponding spectral coefficients of the particular field (such as temperature or vorticity at a given level). The FLT algorithm is based on the fundamental idea that for a given zonal wavenumber all the values of the associated Legendre polynomials at all the Gaussian latitudes of the model grid have similarities that may be exploited in such a way that one does not have to compute all the sums. Rather, FLTs precompute a compressed (approximate) representation of the matrices involved in the original sums and apply this compressed (reduced) representation instead of the full representation at every time step of the model simulation. The FLTs are based on the seminal work of Tygert (2008, 2010), and in particular the algorithm for the rapid evaluation of special functions described in O'Neil et al. (2010) and the efficient interpolative decomposition (ID) matrix compression technique described in Cheng et al. (2005). A more detailed description of the implementation of the FLTs in the IFS and the overall spherical harmonics transform is given in section 2, together with a discussion of their applicability on future massively parallel computing architectures.

There have been earlier attempts to accelerate the Legendre transforms for use in NWP (Suda and Takami 2002; Suda 2005). Such attempts had similar asymptotic cost projections and followed quite similar ideas as in Tygert (2008), but earlier attempts at ECMWF based on Suda's work failed to derive a numerically stable algorithm for large

This article is organized as follows: section 2 describes the parallel spherical harmonics transform as implemented in ECMWF's IFS transform package including the FLT acceleration. Section 3 discusses the computational performance of the FLT algorithm and shows the computational cost at different horizontal resolutions for actual NWP forecasts. Finally, section 4 discusses the use of the spectral transform method on future massively parallel computer architectures and draws some conclusions.

## 2. The fast spherical harmonics transform

### a. The spherical harmonics transform

*ζ*on a vertical level

*l*at or above the surface of a sphere is given by

*θ*and

*λ*denote colatitude (

*φ*≡ 90 −

*θ*is the geographical latitude with 0 at the equator) and longitude, respectively;

*ζ*at level

*l*; and the

*n*and order

*m*as a function of latitude only. The normalization is defined by

*M*and

*N*(

*m*) symbolize the cutoff spectral truncation wavenumber in the spherical harmonics expansion and the choice of

*N*(

*m*) specifies the truncation type. In the IFS the triangular truncation

*N*(

*m*) =

*M*is chosen, favoring a quasi-isotropic distribution (cf. Naughton et al. 1996). Equation (1) represents the discrete inverse spherical harmonics transform that resynthesizes the gridpoint representation from the spectral coefficients.

*m*as

*K*= (2

*N*+ 1)/2 (linear grid) special quadrature points (“Gaussian latitudes”) given by the roots of the ordinary Legendre polynomials

### b. Precomputations

For ultrahigh horizontal resolution (≥*N* = 1000) the cost and the accuracy of the precomputations also become important. In the IFS, the associated Legendre polynomials *m*. Because of this parallel distribution, memory savings resulting from the FLT algorithm (Tygert 2008) are modest in our multiprocessor simulations (≈10% per compute node at T7999). However, at T7999 and higher resolutions this aspect may become significant on future computing architectures. When applying spectral transforms in postprocessing and plotting applications (typically done on single processors) the memory requirements for *N* ~

*N*provided that the initial values for the recurrence are computed sufficiently accurately. The accuracy for large

*N*can be improved by computing the polynomials in extended (quadrupole) precision, but typically this bears a high computational penalty. Alternatively, Schwarztrauber (2002) suggested a Fourier representation of the ordinary Legendre polynomials together with simple formulas to evaluate the corresponding coefficients. The resulting

*N*while maintaining all computations in double precision as opposed to extended precision.

*m*) is still unnecessarily high because the Belousov recurrence formula involves varying degrees and orders of the “intermediate” associated polynomials, whereas ideally one would like a formula that only involves a single

*m*. While there are such formulas readily available in standard references (e.g., Abramowitz and Stegun 1972), these are numerically unstable beyond

*N*≈ 20 (Belousov 1962). For all

*m*≥ 2 we use the Eq. (42) in Tygert (2008)

^{1}as

*m*th derivative of the required ordinary Legendre polynomials

_{l}with

*l*=

*m*,

*m*+ 1,

*m*+ 2,

*m*+ 3 as

*m*− 1)!! ≡ 1 × 3 × 5 × ⋯ × (2

*m*− 3) × (2

*m*− 1). Furthermore, using the formulas for a single

*m*halves the cost by requiring only even (odd) values depending on the symmetric (antisymmetric) part. The IFS model exploits symmetry properties and the computations are split into symmetric and antisymmetric components such that a field is given by

*h*=

*f*+

*g*with

*f*(

*x*) = [

*h*(

*x*) +

*h*(−

*x*)]/2 and

*g*(

*x*) = [

*h*(

*x*) −

*h*(−

*x*)]/2, where

*f*(−

*x*) =

*f*(

*x*) and

*g*(−

*x*) = −

*g*(

*x*) for any

*x*∈ (−1, 1). Correspondingly,

*f*and

*g*are each defined uniquely on the interval (0, 1) (Tygert 2008) and quadrature of the form (5) is applied to

*f*and

*g*. Finally, it should be noted that the formula for a single

*m*readily facilitates the computation of the Legendre polynomials without the need to store them, which may be necessary on very low memory platforms.

### c. The fast Legendre transform

The computational cost of the conventional spectral transform method scales according to *N*^{3} as a result of the cost of evaluating all the sums involved in (1) (Tygert 2008). Since the fastest numerical methods used in geophysical fluid dynamics scale linearly with the number of grid points (i.e., proportional to *N*^{2}), the cost of the Legendre transforms would not be competitive at resolutions *N* = *N* = 2047 (10 km) the very high rate of floating-point operations per second (flops) achieved in matrix–matrix multiplications used within the spectral computations masks the ∝ *N*^{3} cost of this part of the IFS model. All simulations have been performed on the IBM POWER7 775 cluster installed at ECMWF. In all comparisons shown in this paper we have used the IBM cache/processor-optimized implementation of the matrix–matrix multiply routine dgemm from the IBM ESSL-library, which is substantially faster than a naive implementation of the required sums and provides a stringent test for the possible speedups achieved in practice.

*ζ*represents

*l*

_{tot}vertical levels, we may write the inverse Legendre transform for each zonal wavenumber

*m*as a matrix–matrix multiply problem of the following form:

*at all levels and at all Gaussian latitudes. The right-hand side represents the matrix of the associated Legendre polynomial coefficients*

**ζ**_{m}*m*for all total wavenumbers

*n*and at all Gaussian latitudes, symbolically denoted by

*m*at all levels and for all total wavenumbers

*n*. Notably, (13) exposes the parallelism of the problem, since all

*m*,

*l*are independent of each other and may be suitably distributed over the processors. Following Tygert (2010) the FLT algorithm is based on the fundamental idea that for a given zonal wavenumber

*m*the matrix

_{r×k}constitutes a subset of the columns of submatrix

_{r×s}and where matrix

_{k×s}contains a

*k*×

*k*identity matrix with

*k*being called the

*ε*rank of submatrix

_{r×s}(Cheng et al. 2005; Martinsson and Rokhlin 2007). An important point to make is that the application of the ID compression directly on the full original matrix

In Tygert (2008) the FLT has been shown to have an asymptotic cost ∝ *N*^{2} log*N*). However, as pointed out in Tygert (2010), there is no such formal proof for the butterfly algorithm. For wavemoth (Seljebotn 2012), an implementation of the butterfly algorithm for astrophysical applications, an asymptotic scaling of ∝ *N*^{2} log^{2}*N*) is suggested. Our results confirm a substantial reduction of the cost in actual NWP simulations. While the horizontal truncations in our simulations are at the lower end of the projections in Tygert (2008), they are very high in the context of NWP and climate, and we find up to resolutions of T7999 that the spectral transforms scale according to *N*^{2} log^{3}*N*).

The “butterfly method” is described in Tygert (2010) and further illustrated in Seljebotn (2012). The subdivisions into smaller parts by doubling the columns and halving the rows with each level is illustrated in Fig. 1 (see also O'Neil et al. 2010). The main purpose of the subdivisions is to create rank-deficient blocks that can be compressed. At level 0 the submatrices are compressed and the residual is stored. At the next level in the tree the (compressed) left and right neighbors of the previous level form the submatrices of this level and are subsequently compressed. A schematic (pseudocode) illustration of the butterfly algorithm based on O'Neil et al. (2010) is given in the appendix. From our point of view, a heuristic explanation why the butterfly algorithm works well for applications on the sphere is due to the structure of spherical harmonics functions. Figure 2 illustrates two important aspects: first, the amplitude of the spherical harmonics for large *m* is negligibly small toward the poles at many Gaussian latitudes—rows of the matrix *N* (Fig. 2a). Secondly, neighboring *n*—represented by neighboring columns of

(a) Comparison of spherical harmonics functions at *n* = 3999 (solid) and *n* = 3998 (dashed) along a single meridian. (b) Zoom in of the near-equatorial region for one hemisphere.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

(a) Comparison of spherical harmonics functions at *n* = 3999 (solid) and *n* = 3998 (dashed) along a single meridian. (b) Zoom in of the near-equatorial region for one hemisphere.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

(a) Comparison of spherical harmonics functions at *n* = 3999 (solid) and *n* = 3998 (dashed) along a single meridian. (b) Zoom in of the near-equatorial region for one hemisphere.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

The butterfly FLT algorithm fits very well into the existing parallel transform package of the IFS. The only modification required in the precomputations is an additional step to setup and store the compressed symmetric and antisymmetric representations of the matrix

In the current implementation we do not make use of the “depth-first traversal” advocated in Tygert (2010), and due to the parallel distribution over the zonal wavenumbers *m* and over the levels/fields, the memory usage per MPI task is sufficiently low, eliminating the need to recompute the coefficients every time step or otherwise change the programmatic flow. Nevertheless, the suggested optimizations in Seljebotn (2012) regarding the minimization of communications per time step through reorganization of the work flow may be explored in the future.

## 3. Results

There are essentially three tuning parameters that control the behavior of our implementation of the FLT algorithm. The first parameter is the number of columns *I*_{MAXCOLS} in each submatrix at level 0 (cf. the appendix). The IFS simulations appear not to be very sensitive to this parameter so *I*_{MAXCOLS} = 128 is chosen for all horizontal resolutions greater than 159, to minimize wall-clock execution time. However, if the aim was to minimize the number of floating-point operations, *I*_{MAXCOLS} = 64 is found to be more optimal and we found that the choice can indeed be important on different computer architectures. The second parameter is a threshold value *I*_{thresh}. If (e.g., for the symmetric part) (*N* − *m* + 2)/2 ≤ *I*_{thresh}, the standard matrix–matrix multiplication dgemm is more efficient (in wall-clock time) and is applied instead of the FLT. The thresholds used are (256, 512, 600, 900) for the resolutions (799, 1279, 2047, ≥3999), respectively. The corresponding butterfly precomputations are also omitted below the given threshold. The third parameter, which is the only parameter that introduces an approximation to the algorithm, is the accuracy required in the compression part of the algorithm. We find that with *ε* = 10^{−7} equivalent meteorological accuracy as measured in terms of hemispheric root-mean square error (rmse) and anomaly correlation of 500 geopotential height and other parameters (not shown)—typically used to verify technical model changes—is obtained. A stronger compression does reduce the computational cost further, but this does impact the meteorological forecast results negatively. To illustrate the meteorological impact of lossy compression of the ID algorithm (i.e., a loss of accuracy compared to not doing any compression), several T2047 10-day forecast simulations have been run using *ε* = 10^{−10}, *ε* = 10^{−7}, *ε* = 10^{−3}, *ε* = 10^{−2} and a control simulation using dgemm. Figure 3 illustrates that with *ε* = 10^{−7} the accuracy (with respect to rms and anomaly correlation of the 500-hPa geopotential height surface) is very close to the control, while still offering substantial computational savings at higher resolutions as discussed below. Figure 4 illustrates how the ID algorithm alters global prognostic variables by retaining the energetically dominant part (as determined by the lower wavenumbers of the spectrum) up to a selected accuracy. Notably, the ID is closely related to low-rank matrix approximations via singular value decomposition (SVD), where only the largest singular values are retained. In contrast to spectral truncation, the remainder (the smallest scales or largest wavenumbers in the spectrum) that may be viewed as less predictable, are perturbed but not set to zero. The effect of lossy compression in the transform part of the model can be identified in the noise visible in the resulting kinetic energy spectra for different values of *ε* (Fig. 4). Notably, the simulation with *ε* = 10^{−2} becomes numerically unstable after approximately 5 simulation days.

Comparison of (a) 500-hPa geopotential height rmse and (b) anomaly correlation in the Northern Hemisphere for several T2047 10-day forecast simulations with selected values of the compression accuracy *ε*. The control forecast (no FLT, solid, open triangles) is compared with FLT simulations using *ε* = 10^{−10} (dotted, squares), *ε* = 10^{−7} (dash–dotted, open circles), and *ε* = 10^{−3} (dashed, full circles).

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Comparison of (a) 500-hPa geopotential height rmse and (b) anomaly correlation in the Northern Hemisphere for several T2047 10-day forecast simulations with selected values of the compression accuracy *ε*. The control forecast (no FLT, solid, open triangles) is compared with FLT simulations using *ε* = 10^{−10} (dotted, squares), *ε* = 10^{−7} (dash–dotted, open circles), and *ε* = 10^{−3} (dashed, full circles).

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Comparison of (a) 500-hPa geopotential height rmse and (b) anomaly correlation in the Northern Hemisphere for several T2047 10-day forecast simulations with selected values of the compression accuracy *ε*. The control forecast (no FLT, solid, open triangles) is compared with FLT simulations using *ε* = 10^{−10} (dotted, squares), *ε* = 10^{−7} (dash–dotted, open circles), and *ε* = 10^{−3} (dashed, full circles).

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Comparison of kinetic energy spectra at T2047 resolution for different values of *ε*. The value *ε* = 10^{−2} (solid line) is clearly extreme but for *ε* = 10^{−3} (dotted line) only the small-scale part of the spectrum of kinetic energy is perturbed.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Comparison of kinetic energy spectra at T2047 resolution for different values of *ε*. The value *ε* = 10^{−2} (solid line) is clearly extreme but for *ε* = 10^{−3} (dotted line) only the small-scale part of the spectrum of kinetic energy is perturbed.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Comparison of kinetic energy spectra at T2047 resolution for different values of *ε*. The value *ε* = 10^{−2} (solid line) is clearly extreme but for *ε* = 10^{−3} (dotted line) only the small-scale part of the spectrum of kinetic energy is perturbed.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

The simulations using the new FLTs are stable with *ε* ≥ 10^{−3}, and the FLT algorithm results below in actual global NWP simulations up to T7999 resolution have been obtained with *ε* = 10^{−7}. Meteorological results from these ultrahigh-resolution simulations for NWP will be reported elsewhere. However, to illustrate the capabilities and the numerical stability of the simulations, Fig. 5a shows the global kinetic energy spectra of the IFS simulations using the FLTs at different resolutions averaged every 12 h over the first 5 forecast days. The T7999 spectrum obtained from the FLT simulation is an instantaneous spectrum after 6 h of simulation (the end of that simulation). The kinetic energy spectra of equivalent runs using the standard matrix–matrix multiplication dgemm are shown in Fig. 5b, showing identical spectra at all resolutions up to T3999. Notably we did not obtain a spectrum for the T7999 simulation without FLTs, since postprocessing had to be disabled to save sufficient memory for that simulation to succeed, albeit producing only (ASCII) timing information used in the quantitative comparisons.

Comparison of global kinetic energy spectra averaged over the first 5 forecast days at 500-hPa height. (a) The simulation results with FLTs and (b) the simulations using dgemm. T1279 (≈16 km) (dotted) is the current operational resolution. For comparison the resolutions T799 (≈25 km, dash–dotted), T2047 (≈10 km, dashed), T3999 (≈5 km, solid), and T7999 (≈2.5 km, long dash) are shown. Note, that all simulations were initialized with a T1279 analysis. The T7999 represents an instantaneous spectrum after a 6-h forecast.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Comparison of global kinetic energy spectra averaged over the first 5 forecast days at 500-hPa height. (a) The simulation results with FLTs and (b) the simulations using dgemm. T1279 (≈16 km) (dotted) is the current operational resolution. For comparison the resolutions T799 (≈25 km, dash–dotted), T2047 (≈10 km, dashed), T3999 (≈5 km, solid), and T7999 (≈2.5 km, long dash) are shown. Note, that all simulations were initialized with a T1279 analysis. The T7999 represents an instantaneous spectrum after a 6-h forecast.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Comparison of global kinetic energy spectra averaged over the first 5 forecast days at 500-hPa height. (a) The simulation results with FLTs and (b) the simulations using dgemm. T1279 (≈16 km) (dotted) is the current operational resolution. For comparison the resolutions T799 (≈25 km, dash–dotted), T2047 (≈10 km, dashed), T3999 (≈5 km, solid), and T7999 (≈2.5 km, long dash) are shown. Note, that all simulations were initialized with a T1279 analysis. The T7999 represents an instantaneous spectrum after a 6-h forecast.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Figure 6 shows the measured number of floating-point operations used for the inverse or direct spectral transform of a single global field at different horizontal resolutions, using either the FLTs or the dgemm routine. The numbers have been scaled with *N*^{2} log^{3}*N*. At T7999 resolution, approximately 4 times less floating-point operations are required using the FLTs, and the scaling implies an asymptotic increase of the number of operations ∝ (*N*^{2} log^{3}*N*) with the FLTs and an increase of *N*^{3} with the dgemm routine. At lower resolutions there are savings in the number of floating-point operations but not in wall-clock time, since the standard dgemm matrix–matrix multiplications are highly optimized and memory cache use is efficient. When applied to large matrices, these dgemm calls in IFS run at much higher computational rates (4–7 gigaflops sustained), compared to other parts of the IFS model (e.g., physical parameterizations such as the vertical diffusion scheme that run at 0.7–1 gigaflops or less). Figure 7 illustrates that the savings in the number of floating-point operations are not evenly distributed across the wavenumber space, with the largest savings for large *N* − *m* values. Therefore the threshold value *I*_{thresh} has been introduced, below which the standard matrix–matrix multiplication dgemm is used for maximum efficiency in wall-clock time. Notably, all smaller-sized submatrices within the FLT tree level structure also use dgemm.

Number of floating-point operations used for the inverse or direct spectral transform of a single global field at different horizontal resolutions using the FLTs (right bars) compared to using dgemm (left bars). Numbers are scaled with *N*^{2} log^{3}*N*.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Number of floating-point operations used for the inverse or direct spectral transform of a single global field at different horizontal resolutions using the FLTs (right bars) compared to using dgemm (left bars). Numbers are scaled with *N*^{2} log^{3}*N*.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Number of floating-point operations used for the inverse or direct spectral transform of a single global field at different horizontal resolutions using the FLTs (right bars) compared to using dgemm (left bars). Numbers are scaled with *N*^{2} log^{3}*N*.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Number of floating-point operations (×10^{6}) required for the different zonal wavenumbers (*m*). In this example taken from a T3999 simulation, (*N* − *m* + 2)/2 ≤ *I*_{thresh} = 900, which means that for all *m* < 2200 the butterfly algorithm is applied. The simulation has been optimized for wall-clock time, resulting in a sharp drop in the number of floating-point operations used when switching to the FLT algorithm (solid line). The small steps in the curve relate to the levels of subdivisions in the butterfly algorithm.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Number of floating-point operations (×10^{6}) required for the different zonal wavenumbers (*m*). In this example taken from a T3999 simulation, (*N* − *m* + 2)/2 ≤ *I*_{thresh} = 900, which means that for all *m* < 2200 the butterfly algorithm is applied. The simulation has been optimized for wall-clock time, resulting in a sharp drop in the number of floating-point operations used when switching to the FLT algorithm (solid line). The small steps in the curve relate to the levels of subdivisions in the butterfly algorithm.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Number of floating-point operations (×10^{6}) required for the different zonal wavenumbers (*m*). In this example taken from a T3999 simulation, (*N* − *m* + 2)/2 ≤ *I*_{thresh} = 900, which means that for all *m* < 2200 the butterfly algorithm is applied. The simulation has been optimized for wall-clock time, resulting in a sharp drop in the number of floating-point operations used when switching to the FLT algorithm (solid line). The small steps in the curve relate to the levels of subdivisions in the butterfly algorithm.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Figure 8 shows the average total wall-clock time computational cost (in milliseconds) per spectral transform during actual NWP simulations at different horizontal resolutions, comparing simulations with the new FLTs and simulations using dgemm. Both show a significant cost increase when increasing the horizontal resolution globally from T3999 (≈5-km horizontal grid length) to T7999 (≈2.5 km). In Fig. 9 the cost of 10^{7} transforms (derived from the same simulations) is scaled by *N*^{2} log^{3}*N*. The results imply an asymptotic increase that is equal to the results for the number of floating-point operations [i.e., ∝ *N*^{2} log^{3}*N*)] when using the FLTs up to the largest tested resolution T7999. Comparing Figs. 9 and 6 it is clear that there is a delay in obtaining actual benefits in terms of reducing the wall-clock execution time by the relative reduction of floating-point operations because of the apparent improved efficiency of dgemm for larger matrices. Notably, for the optimized dgemm the wall-clock time increase with increased resolution scales much better than the ∝ *N*^{3}) increase of the floating-point operations.

Average total wall-clock time computational cost (in milliseconds) per transform as obtained from actual NWP simulations with 91 vertical levels and different horizontal resolutions for simulations with FLTs (right bars) and with the standard dgemm (left bars), respectively.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Average total wall-clock time computational cost (in milliseconds) per transform as obtained from actual NWP simulations with 91 vertical levels and different horizontal resolutions for simulations with FLTs (right bars) and with the standard dgemm (left bars), respectively.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Average total wall-clock time computational cost (in milliseconds) per transform as obtained from actual NWP simulations with 91 vertical levels and different horizontal resolutions for simulations with FLTs (right bars) and with the standard dgemm (left bars), respectively.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Average total wall-clock time computational cost (in milliseconds) of 10^{7} transforms for the same simulations as in Fig. 8, but scaled with *N*^{2} log^{3}*N*, illustrating the approximate scaling found when using the FLTs (right bars) up to the largest tested T7999 horizontal resolution. Notably, with the optimized dgemm routine a scaling in wall-clock time that is much better than *N*^{3} is obtained in the tested range.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Average total wall-clock time computational cost (in milliseconds) of 10^{7} transforms for the same simulations as in Fig. 8, but scaled with *N*^{2} log^{3}*N*, illustrating the approximate scaling found when using the FLTs (right bars) up to the largest tested T7999 horizontal resolution. Notably, with the optimized dgemm routine a scaling in wall-clock time that is much better than *N*^{3} is obtained in the tested range.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Average total wall-clock time computational cost (in milliseconds) of 10^{7} transforms for the same simulations as in Fig. 8, but scaled with *N*^{2} log^{3}*N*, illustrating the approximate scaling found when using the FLTs (right bars) up to the largest tested T7999 horizontal resolution. Notably, with the optimized dgemm routine a scaling in wall-clock time that is much better than *N*^{3} is obtained in the tested range.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

There is also a long history in improving the efficiency of the spectral transform method by reducing the number of transforms actually required (cf. Temperton 1991). In particular, the introduction of the semi-Lagrangian advection removed the requirement to explicitly evaluate the nonlinear advective derivatives and thus eliminated the need to spectrally transform the moist prognostic variables and the chemical tracers, representing a substantial cost saving. The typical average number of spectral transforms per time step in the nonhydrostatic IFS, *N*_{TFs}, can be calculated as *N*_{TFs} = *I* × [*l*_{tot} × (*N*_{prog3D} + *N*_{der3D} + *N*_{filt}) + *N*_{prog2D} + *N*_{der2D}], where *N*_{prog3D} and *N*_{prog2D} denote the minimum number of three-dimensional and two-dimensional prognostic variables required. The sum of *N*_{der3D} and *N*_{der2D} denote the number of horizontal derivatives requiring extra transforms, and *N*_{filt} ≡ 2 denotes extra transforms related to a dealiasing filter applied in the IFS. All these numbers are different for inverse and direct Legendre transforms and are also different for the Fourier transforms, since for example for the zonal and meridional winds (*u*, *υ*) meridional derivatives can be derived from zonal derivatives and thus only require two extra Fourier transforms (Temperton 1991). The iteration counter *I* indicates the number of iterations in the iterative-centered-implicit (ICI) time-stepping scheme (Bénard 2003; Bénard et al. 2010). The primary reason why the hydrostatic model is approximately half the overall cost of the nonhydrostatic IFS model is the substantially reduced number of required spectral transforms in the former. Specifically, *I* = 1 in the hydrostatic model with (*N*_{prog3D}, *N*_{prog2D}, *N*_{der3D}, *N*_{der2D}) = (4, 2, 1, 2) for the inverse Legendre transforms, and *I* = 2 with (7, 2, 3, 2) in the nonhydrostatic case; *N*_{der3D} = *N*_{der3D} + 2 for the inverse Fourier transforms, whereas there are no derivatives in the direct Fourier and Legendre transforms with *N*_{prog3D} = *N*_{prog3D} − 2. All simulations in the comparison have been run with the nonhydrostatic IFS model with *l*_{tot} = 91 vertical levels for each three-dimensional prognostic variable. It should be noted, however, that simulations up to T2047 are usually run with the hydrostatic option with indiscernible meteorological results. The above numbers are not considering additional transforms at initial time and for postprocessing purposes.

## 4. Discussion and conclusions

The European Centre for Medium-Range Weather Forecasts (ECMWF) plans to implement a global horizontal resolution of approximately 10 km by 2015 for its assimilation and high-resolution forecasts, and approximately 20 km for the ensemble forecasts. The scales resolved at these resolutions are still hydrostatic and the efficiency and accuracy of the hydrostatic, semi-Lagrangian, semi-implicit solution procedure using the spectral transform method is shown to remain a relevant benchmark for NWP applications at this resolution. The computational efficiency of the hydrostatic IFS model has been further illustrated in the Athena project, a NSF funded initiative to determine the feasibility of using dedicated supercomputing resources to rapidly accelerate progress in modeling climate variability out to decadal and longer time scales (Kinter et al. 2013; Jung et al. 2012). Over a 6-month period, IFS was run for 3 × 47 yr at T159 resolution, 1 × 47 yr at T511, 3 × 47 yr at T1279, and an additional 19 yr at T2047 resolution. With the results in section 3 and given the necessity to often perform ensembles at medium-range, monthly, seasonal or climate-scale lengths, the semi-implicit semi-Lagrangian, spectral transform model appears to be still a competitive method in the medium term, at least if only hydrostatic scales are resolved.

Moreover, the results presented in this paper suggest that the concern about the disproportionally growing computational cost of the Legendre transforms with increasing resolution has been mitigated. At T2047 (≈10 km), T3999 (≈5 km), and at T7999 (≈2.5 km) horizontal resolutions in actual NWP simulations, we find that the spectral transform computations scale ∝ *N*^{2} log^{3}*N*). The computational cost of all the spectral computations (the spectral transforms and the spectral computations associated with solving the Helmholtz equation arising from the semi-implicit solution procedure) relative to the total computational cost of the model is 33% at T3999 resolution with 91 vertical levels for the nonhydrostatic model with *I* = 2. Here the MPI parallelization over the wavenumbers *m* efficiently distributes the work to reduce the overall wall-clock time and the memory footprint.

The (energy) cost and latency of the parallel communications within the spectral transforms and the communications within the spectral computations, however, have not been addressed in this paper and remain a concern. Spectral-to-gridpoint transformations require data-rich (and energy inefficient) global communications at every time step that may become too expensive on future massively parallel computers. The latter aspect is investigated as part of the Collaborative Research into Exascale Systemware, Tools and Applications (CRESTA) project, and preliminary results suggest a way forward in the medium term through the use of modern computer language concepts (e.g., PGAS, coarray FORTRAN) overlapping computations and communications. Comparing the relative computational cost of the spherical harmonics transforms plus all the spectral computations (i.e., solving the Helmholtz equation and other spectral computations) as a percentage of the overall model cost, it is found that at all resolutions and configurations tested the overhead due to the MPI communications required is ≈12% on the IBM cluster, as can be deduced by comparing the left and right columns in Fig. 10 for each horizontal resolution. Each bar in Fig. 10 has been derived considering all gridpoint computations (e.g., related to semi-Lagrangian advection) and physics computations (including the wave model) but without considering IO, synchronization costs (barriers, in particular in the transpositions), and other ancillary costs. All runs used the nonhydrostatic version of the IFS with 91 vertical levels. The relatively small overhead indicates a good potential for “hiding” the communications behind the computations. However, communication cost may increase substantially with increased numbers of compute threads, and only up to T2047 the tested configurations reflect the operational use situation of ≥240 simulated forecast days per day. With the current implementation there is still a small overall relative growth of the total spectral computations cost with increasing resolution. However, it should be noted that (i) the maximum overall cost of all aspects associated with the spectral part of the model is less than 50% of the overall model cost, and (ii) that all simulations were done with the existing nonhydrostatic model. For the hydrostatic model the relative cost of all aspects associated with the spectral part is approximately half of the values shown for each horizontal resolution in Fig. 10, so for the hydrostatic model the maximum overall cost of all aspects associated with the spectral part of the model is less than 25% of the overall model cost. Given the large number of transforms required in the existing nonhydrostatic model, further research in the medium term will thus also focus on removing iterations that involve a duplication of all spectral transforms.

Relative computational cost comparison in percent of the total execution time for the respective IFS simulations. In the right bars at every resolution they indicate the total cost of the spectral part of the code, including communications related to the spectral computations, the spherical harmonics transforms, and all other spectral computations. The left bars indicate only the computational cost of the transforms and the spectral computations without communication cost. See the text for more details.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Relative computational cost comparison in percent of the total execution time for the respective IFS simulations. In the right bars at every resolution they indicate the total cost of the spectral part of the code, including communications related to the spectral computations, the spherical harmonics transforms, and all other spectral computations. The left bars indicate only the computational cost of the transforms and the spectral computations without communication cost. See the text for more details.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Relative computational cost comparison in percent of the total execution time for the respective IFS simulations. In the right bars at every resolution they indicate the total cost of the spectral part of the code, including communications related to the spectral computations, the spherical harmonics transforms, and all other spectral computations. The left bars indicate only the computational cost of the transforms and the spectral computations without communication cost. See the text for more details.

Citation: Monthly Weather Review 141, 10; 10.1175/MWR-D-13-00016.1

Moreover, there is some evidence that a substantial further reduction in the wall-clock time computational cost of the spectral transforms may be achieved through applying low energy general purpose graphics processor units (GPGPUs). In the latter context it will be of particular interest to explore the additional parallelism offered by the described butterfly tree level structure.

Finally, we note the potential opportunities offered by lossy compression in the FLTs as a paradigm for trading numerical accuracy at the smallest scales for energy efficiency in ensemble-based uncertainty estimation.

## Acknowledgments

We thank M. Tygert for his suggestion to implement the butterfly algorithm instead of the Cuppen-based FLT. We also gratefully acknowledge the implementation of a very elegant recursive Cuppen algorithm by Mike Fisher used in an earlier attempt to implement a FLT. Moreover we thank Agathe Untch, Erland Källén, Peter Dueben, and two anonymous reviewers for their comments and suggestions. We are also grateful to Rob Hine for his help with preparing the figures. The author Nils Wedi benefited from a stay at the Newton Institute, Cambridge, during the program on “Multiscale Numerics for the Atmosphere and Ocean.”

## APPENDIX

### Sketch of the Butterfly Algorithm

*I*

_{MAXCOLS}columns (except the last one, in the general case). In our implementation the following code is used to prepare for the compressed matrix representation of the form

_{r×s}=

_{r×k}

_{k×s}:

**is computed to precision**

*α**ε*, where

**is an arbitrary vector:**

*α*_{L,j}.

## REFERENCES

Abramowitz, M., and I. Stegun, Eds., 1972:

*Handbook of Mathematical Functions*. Dover Publications, 332–333.Barros, S. R. M., D. Dent, L. Isaksen, G. Robinson, G. Modzdzynski, and F. Wollenweber, 1995: The IFS model: A parallel production weather code.

,*Parallel Comput.***21**, 1621–1638.Belousov, S. L., 1962:

*Tables of Normalized Associated Legendre Polynomials.*Pergamon Press, 379 pp.Bénard, P., 2003: Stability of semi-implicit and iterative centered-implicit time discretizations for various equation systems used in NWP.

,*Mon. Wea. Rev.***131,**2479–2491.Bénard, P., J. Vivoda, J. Mašek, P. Smolíková, K. Yessad, C. Smith, R. Brožková, and J.-F. Geleyn, 2010: Dynamical kernel of the Aladin-NH spectral limited-area model: Revised formulation and sensitivity experiments.

,*Quart. J. Roy. Meteor. Soc.***136**, 155–169.Cheng, H., Z. Gimbutas, P. Martinsson, and V. Rokhlin, 2005: On the compression of low rank matrices.

,*SIAM J. Sci. Comput.***26**(4), 1389–1404.Cooley, J. W., and J. W. Tukey, 1965: An algorithm for the machine calculation of complex Fourier series.

,*Math. Comput.***19**, 297–301.Cotter, C. J., and J. Shipton, 2012: Mixed finite elements for numerical weather prediction.

,*J. Comput. Phys.***231**(21), 7076–7091.Egersdörfer, R., and L. Egersdörfer, 1936: Formeln und Tabellen der zugeordneten Kugelfunktionen 1. Art von n=1 bis n=20, Teil 1: Formeln. Wissenschaftliche Abhandlungen Bd 1 (6), Deutsches Reichsamt für Wetterdienst, Springer, Berlin, 67 pp.

Eliasen, E., B. Machenhauer, and E. Rasmussen, 1970: On a numerical method for integration of the hydrodynamical equations with a spectral representation of the horizontal fields. Rep. 2, Institut for Teoretisk Meteorologi, University of Copenhagen, 37 pp.

Jablonowski, C., R. C. Oehmke, and Q. F. Stout, 2009: Block-structured adaptive meshes and reduced grids for atmospheric general circulation models.

,*Philos. Trans. Roy. Soc. London***367A**, 4497–4522.Jung, T., and Coauthors, 2012: High-resolution global climate simulations with the ECMWF model in project Athena: Experimental design, model climate, and seasonal forecast skill.

,*J. Climate***25**, 3155–3172.Kinter, J., and Coauthors, 2013: Revolutionizing climate modeling project Athena: A multi-institutional, international collaboration.

,*Bull. Amer. Meteor. Soc.***94**, 231–245.Martinsson, P. G., and V. Rokhlin, 2007: An accelerated kernel-independent fast multipole method in one dimension.

,*SIAM J. Sci. Comput.***29**, 1160–1178.Naughton, M., P. Courtier, and W. Bourke, 1996: Representation errors in various grid and spectral truncations for a symmetric feature on the sphere.

,*Quart. J. Roy. Meteor. Soc.***122,**253–265.O'Neil, M., F. Woolfe, and V. Rokhlin, 2010: An algorithm for the rapid evaluation of special function transforms.

,*Appl. Comput. Harmon. Anal.***28**(2), 203–226.Orszag, S. A., 1970: Transform method for calculation of vector coupled sums: Application to the spectral form of the vorticity equation.

,*J. Atmos. Sci.***27**, 890–895.Robin, L., 1957:

*Fonctions Sphériques de Legendre et Fonctions Sphéroidales.*Vol. I. Gauthier-Villars, 874 pp.Schwarztrauber, P. N., 2002: On computing the points and weights for Gauss-Legendre quadrature.

,*SIAM J. Sci. Comput.***24**, 945–954.Seljebotn, D. S., 2012: Wavemoth–Fast spherical harmonic transforms by butterfly matrix compression.

,*Astrophys. J.***199**(5), 1–12, doi:10.1088/0067-0049/199/1/5.Staniforth, A., and J. Thuburn, 2012: Horizontal grids for global weather and climate prediction models: A review.

,*Quart. J. Roy. Meteor. Soc.***138**, 1–26.Suda, R., 2005: Fast spherical harmonic transform routine FLTSS applied to the shallow water test set.

,*Mon. Wea. Rev.***133**, 634–648.Suda, R., and M. Takami, 2002: A fast spherical harmonics transform algorithm.

,*Math. Comput.***71**, 703–715.Temperton, C., 1983: Self-sorting mixed-radix fast Fourier transforms.

,*J. Comput. Phys.***52**, 1–23.Temperton, C., 1991: On scalar and vector transform methods for global spectral models.

,*Mon. Wea. Rev.***119**, 1303–1307.Tygert, M., 2008: Fast algorithms for spherical harmonic expansions, II.

,*J. Comput. Phys.***227**, 4260–4279.Tygert, M., 2010: Fast algorithms for spherical harmonic expansions, III.

,*J. Comput. Phys.***229**, 6181–6192.Williamson, D. L., 2007: The evolution of dynamical cores for global atmospheric models.

,*J. Meteor. Soc. Japan***85B**, 241–269.

^{1}

See also p. 100, Eq. (47′) in Robin (1957), and normalize the polynomials according to (11).