## Abstract

This study describes an application of the double Fourier series (DFS) spectral method developed by Cheong as an alternative dynamical option in a model system that was ported into the Global/Regional Integrated Model System (GRIMs). A message passing interface (MPI) for a massive parallel-processor cluster computer devised for the DFS dynamical core is also presented. The new dynamical core with full physics was evaluated against a conventional spherical harmonics (SPH) dynamical core in terms of short-range forecast capability for a heavy rainfall event and seasonal simulation framework. Comparison of the two dynamical cores demonstrates that the new DFS dynamical core exhibits performance comparable to the SPH in terms of simulated climatology accuracy and the forecast of a heavy rainfall event. Most importantly, the DFS algorithm guarantees improved computational efficiency in the cluster computer as the model resolution increases, which is consistent with theoretical values computed from a dry primitive equation model framework. The current study shows that, at higher resolutions, the DFS approach can be a competitive dynamical core because the DFS algorithm provides the advantages of both the spectral method for high numerical accuracy and the gridpoint method for high performance computing in the aspect of computational cost.

## 1. Introduction

The spectral method that incorporates spherical harmonics (SPH) as orthogonal basis functions has been widely used in global atmospheric models for weather forecasts and climate prediction (Hoskins and Simmons 1975; Sela 1980; Kanamitsu 1989). It is prevalent because it provides high numerical accuracy relative to the gridpoint method. However, as pointed out by many researchers (Randall et al. 2000; Stuhne and Peltier 2006), the dynamical core with SPH becomes less efficient for high performance computing as the model resolution increases. This is due not only to low scalability by nonlocal calculation, but even more to the fact that the Legendre transform in the SPH dynamical core requires *O*[*N*^{3}] operations, with *N* being the truncation wavenumber, over other parts of the computation (Cheong 2000a,b; Tomita and Satoh 2004).

To alleviate these computational inefficiencies in the classic SPH method, diverse efforts have been made in the area of global modeling to improve the computational aspect of the SPH algorithm. For example, an efficient implementation technique for commodity clusters was developed (Rivier et al. 2002). This preserves the SPH dynamical core and the corresponding grid definitions. Another solution to improve the performance is to apply a reduced spectral transformation (Juang 2004). However, these methods do not resolve inherent SPH core defects in terms of computational complexity (*O*[*N*^{3}]) at high resolutions. Some researchers have developed a fast computational algorithm for the Legendre transform (Suda and Takami 2002; Tygert 2008, 2010), in which operation counts are formally represented with inefficient *O*[*N*^{2} log*N*] for *N*^{2} complexity. These algorithms in the literature were found to be faster than using the direct transform of SPHs. High efficiency, however, is not achieved until the computational complexity is far larger than that in current models. Furthermore, their performance in terms of computational efficiency was found to be dependent on the prescribed criteria of numerical errors, which is usually larger than the machine round-off.

To address these issues, the gridpoint method with different grid configurations has recently been proposed by some researchers and used in several institutions (Ringer et al. 2000; Majewski et al. 2002; Tomita and Satoh 2004). However, a robust evaluation of these gridpoint models at high resolutions over the spectral dynamical core has not yet been conducted, even though a core developed by Tomita and Satoh was found to successfully capture multiscale tropical convective systems in a 3.5-km mesh global experiment (Satoh et al. 2008). Another alternative is to adopt the spectral element method (Taylor et al. 1997) since it can provide efficient local mesh refinement on a spherical surface. Various types of grid systems are also available (Ullrich et al. 2012). Refer to Staniforth and Thuburn (2012) for an extensive review of the horizontal grid system of global atmospheric models.

Meanwhile, a dynamical core with the spectral method using the double Fourier series (DFS) as basis functions (Cheong 2000a,b) has been proposed. The suggested DFS dynamical core retains most of the advantages available for the traditional spectral method and provides an efficient and simple high-order filtering method that is essential for stable time stepping. The high-order filter was shown to provide selective damping of small-scale turbulences without affecting the physics of the phenomenon (Cheong et al. 2004). Cheong (2006, hereafter C06) further extended the DFS spectral method to a three-dimensional (3D) primitive equation model with hydrostatic approximation.

The purpose of this study is to implement the DFS spectral method in a global atmospheric model with full physics. While the research in C06 demonstrated that the DFS method outperforms the SPH method in terms of representing the dynamical aspects of a 3D atmosphere, implementation of the DFS method into a full atmospheric forecast system is neither trivial nor straightforward. It should also be noted that the 3D model in C06 is a dry primitive equation model without topography. In this study, a message passing interface (MPI) facility for a massive parallel-processor cluster computer is built for the DFS dynamical core. We discuss the preliminary evaluation results from an identical model system with two different cores for a heavy rainfall event over South Korea and for seasonal precipitation over the globe.

## 2. Implementation of the DFS spectral method

### a. DFS implementation

The global model program (GMP) used in this study is the global atmospheric model component of the Global/Regional Integrated Model System (GRIMs; Hong et al. 2013), which was created for use in numerical weather prediction (NWP), seasonal simulations, and climate research projects, on scales from global to regional. The system also has multiple options in each physics parameterization for physics development, as well as mechanism studies. The history and configuration of GRIMs and the capabilities in NWP and climate simulations were demonstrated in Hong et al. (2013). The basic model is described with a set of primitive equations that include the vorticity *ζ*, divergence , the logarithm of surface pressure (), specific humidity , and perturbation of virtual temperature as dependent variables. The same dependent variables are employed in both the DFS and SPH dynamical cores. In the DFS, the vertical layer-dependent viscosity and diffusion, which are given in the form of a fourth-order spherical Laplacian (), are applied to the vorticity, divergence, virtual temperature, and specific humidity fields. Horizontal diffusion is discussed in detail in section 2c.

The basis function of the DFS dynamical core is the sine or cosine series. Like the SPH, the DFS requires a wave-grid-wave transformation. The spectral coefficients for transforming values in the grid space, , to those in the wave space in the north–south direction for given zonal wavenumbers *m* can be written as

where *b* = 1 for *n* = 0 and *b* = 2 for *n* > 0, *c* = 1 for *n* = *J* and *c* = 2 for *n* < *J*, and with *+* latitude. In Eq. (1), *J*, *n*, *m*, and *j* denote the total number of grids in a meridian, the wavenumber in the north–south direction, the wavenumber in the east–west direction, and the grid index of the north–south direction, respectively.

The corresponding transformation from wave space to gridpoint space can be obtained by

where *n* and *m* denote the wavenumber in the meridional and zonal directions, respectively.

The DFS grid system does not include the North and South Poles to avoid singularity at the poles. The wavenumber domain is rectangular in the zonal and meridional directions, which differs from triangular truncation in the SPH option. The vertical distribution of the sigma layer is defined as half of two adjacent interface levels in the DFS option as the SPH's vertical discretization method. In addition, we added specific humidity as a prognostic variable, which is absent in the original DFS dynamical core of C06.

To take advantage of the flexibility of having two different spectral methods, a configuration file to control the source code for either the SPH or DFS dynamical core is utilized. In other words, besides the spectral transformation and related grid definitions, all source codes and run-time scripts have been modified to include the option for the DFS dynamical core so that the dynamical core can be selected in the same source code using the C preprocessor. The model initial and restart files for the atmospheric data are stored as spectral coefficients for a given transformation. Thus, atmospheric data in spectral form can be used for both dynamical cores. All boundary conditions and initial fields are reconstructed according to the different physics of the grid system and the wavenumber domain.

### b. MPI development

The massive parallel-processor cluster computer is the mainstream version of the super computer for numerical modeling in a multiparallel environment. The MPI is the most commonly used package for massively parallel processing (MPP). Implementation of the MPI into the DFS dynamical core is a crucial step in the development of the model. The DFS code was parallelized using the MPI library so that it could be executed on a distributed memory machine system. The grid and wave variable domains are decomposed in two directions. For fast Fourier transform (FFT) in the horizontal *x* or *y* directions, gridpoint values of the entire domain are needed for each transform direction. In this case, the model domain should not be sliced in transform directions; instead, the *z* direction can be sliced. The consequence of this scenario is that communication among all processors is needed so as to arrange data in the memory for whole components of the zonal (meridional) direction and partial components for the meridional (zonal) and vertical directions. The proposed MPI scenario is to rearrange data in memory so as to have access to the whole components according to the calculations that need to be done (Barros et al. 1995), but a more sophisticated scenario should be taken into account for the DFS dynamical core as described below.

A schematic of the transformation procedure and the MPI between the wave and gridpoint spaces are shown in Fig. 1. In this case, two-dimensional (2D) decomposition can be used because a single direction slice alone is not feasible because of the transformation; one of the horizontal directions and the vertical direction can be sliced. In Fig. 1, an example is shown of the entire FFT process with 12 tasks using 2D decomposition with four columns and three rows. The direction of the computing process from gridpoint space to spectral space is denoted with black arrows, and a step-by-step description is as follows. First, the input data at the gridpoint space is read in and distributed, as shown in the cube in the bottom-left corner of Fig. 1. An FFT is then applied to each processor, from the cube in the bottom-left corner to the cube in the bottom-right corner. Prior to performing the Fourier transform in the *y* direction, the bottom-right cube is transposed into the cube in the top-right corner. After transposing, a Fourier transform in the *y* direction can be performed; this proceeds from the top-right cube to the top-middle cube. Thus, it completes the transformation from gridpoint space (bottom-left cube) to spectral space (top-middle cube). The procedure to go from spectral space to gridpoint space follows the dashed arrow in the reverse direction.

Unlike the traditional SPH with an eigenvalue for solving the Laplacian [i.e., , where *l* is the total wavenumber], a tridiagonal matrix that requires nonlocal calculation in the meridional direction is necessary for the Laplacian operator in the DFS method (Cheong et al. 2002, 2004) to calculate horizontal diffusion and wind components from the vorticity and divergence. Since transpose or all-to-all communication aggravates scalability in parallel computing, data on spectral space are archived in the form of the top-middle cube. For semi-implicit time integration, however, the top-middle cube is changed to the top-left cube, as shown in Fig. 1.

### c. Horizontal diffusion

The nonlinear evolution equations generate an energy cascade from lower to higher wavenumber modes, and in underresolved calculations the energy builds up at the truncation limit, creating a spectrum with an erroneously high tail—the so-called spectral blocking problem or aliasing (Gelb and Gleeson 2001). The traditional approach to dealing with these difficulties in the spectral method is to add an artificial diffusion term (called the “horizontal diffusion” term in numerical models) to the equations of motion, a hyper-order derivative of the space variables. Excluding the dynamical and parameterized tendency, the horizontal diffusion tendency can be expressed in the form of a hyper-Laplacian as

where *χ* is any prognostic variable (e.g., wind, vorticity/divergence, and temperature), *q* (= 1, 2, 3, 4, …) denotes the order, and *K _{q}* is the diffusion coefficient (i.e., diffusivity or viscosity) for a given order. The diffusion coefficient can be readily determined by specifying the damping rate of a prescribed scale (e.g., Jablonowski and Williamson 2011) as shown below:

where *τ* is the *e*-folding damping time, is the earth's radius, *N _{p}* is a prescribed scale of the wavenumber, and is an additional tuning parameter. Therefore, the diffusion coefficient depends on the horizontal resolution for a given damping time (i.e., horizontal scale-dependent diffusion). For the GRIMs-SPH dynamical core, the diffusion coefficient is based on coefficients obtained from the National Centers for Environmental Prediction (NCEP) Global Forecast System (GFS), adopting an eighth-order diffusion if the truncated wavenumber

*N*is over 170, whereas unique biharmonic (i.e., second order) diffusion, as described by Leith (1971), is applied when

*N*is under 170 (see http://www.emc.ncep.noaa.gov/GFS).

The DFS dynamical core uses eighth-order diffusion, as well. The damping time is 0.1 day, and the tuning parameter is simply defined as

where is the tuned factor (between 0 and 1) for the DFS dynamical core. This implies that the prescribed wavenumber is defined as the product of and for the *e*-folding damping time of 0.1 day. Compared to the SPH, the prescribed scale is smaller, whereas damping time is longer, resulting in comparable diffusivity between the two dynamical cores. However, horizontal diffusion is not dependent on the model layer in the DFS dynamical core of C06, whereas strength of diffusion damping increases higher up within the SPH dynamical core.

As discussed in section 2b, the DFS Laplacian operator requires a tridiagonal matrix that depends on horizontal diffusion coefficients. Since solving a tridiagonal matrix equation is a time-consuming process with large memory allocation, it would be inefficient to apply different diffusion coefficients over all vertical layers, in terms of computational cost. Therefore, the tuned factor can be defined as falling into three categories, as follows:

which was empirically determined through comparison with the SPH dynamical core.

## 3. Discussion

The scalability using multiple processors should be checked to identify how well the DFS algorithm is parallelized. Assuming the time spent in communication is negligible, speedup is calculated as follows:

where *T*_{1} is the time spent by a single-processor run, is the time spent by *n* processor runs, *s* is the time spent by the sequential code in the computing, and *p* is the total time spent by the parallel code in computing with *n* processors (Juang et al. 2003). Figure 2 shows the speedup of the SPH and DFS dynamical cores with full physics on T254 resolution (768 grids in the zonal direction and 384 grids in the meridional direction). As shown, both models are reasonably parallelized in the theoretical speedup range of 95%–99%, and the parallelized implementation of the DFS dynamical core shows efficiency comparable to the SPH.

### a. Computational efficiency and dynamical characteristics of the DFS dynamical core

In this section, the computational efficiency of the DFS dynamical core is discussed and compared with that of the SPH dynamical core. The results at horizontal resolutions ranging from T62 (~200 km) to T510 (~25 km) are described. A case of heavy rainfall over Korea in July 2001 is selected to compare the computational efficiencies of the two dynamical cores. The initial conditions are taken from NCEP–Department of Energy (DOE) reanalysis data (Kanamitsu et al. 2002). The model results for a 5-day forecast period are verified against analyzed data. To alleviate mechanical problems such as unbalanced communication between processors, each profiling was repeated 3 times and then averaged.

The parallel run performance for 254 wave truncations for the DFS is shown in Fig. 3. It is clear that the DFS dynamical core is more efficient in terms of total profiling time (Fig. 3a). Computational efficiency is more distinct in computing the dynamics and MPI modules by a factor of about 2. The computation time for the physics modules is about the same as expected. In Fig. 3b, the MPI wall-clock time indicates the total computational cost including not only parallel computing but also serial processes such as I/O. Although improvement of the computational cost is inconspicuous at relatively low resolutions, the efficiency remarkably increases as the model resolution increases (approximately 30% at T510 resolution).

For the SPH, the total kinetic energy (KE) spectrum per unit mass can be calculated as (Koshyk and Hamilton 2001)

where *m* and *n* are the zonal and total wavenumbers, respectively, and and *D* are the vorticity and divergence on a spectral space, respectively. Here, *a* is the earth's radius, and is the vertical sigma level. For the DFS, the vorticity and divergence on a spectral space are transformed onto a grid space and retransformed onto a SPH spectral space to represent the KE spectrum in terms of the total wavenumber.

The KE spectra computed from the experiments with the DFS and SPH dynamical cores for various horizontal resolutions are shown in Fig. 4. It is clear that the distribution of the energy spectra for both dynamical cores are in line with previous observations of −3 (−) slope at large (meso) scales (Boer and Shepherd 1983; Nastrom et al. 1984), except at high wavenumbers. At low resolutions (i.e., T62 and T126), differences in the spectra for small scales become visible with a weakened magnitude in the case of the DFS dynamical core. A close inspection reveals that the DFS run experiences a collapse of energy at high wavenumbers, whereas the amount of energy remains significant in the case of the SPH dynamical core. The behavior of the SPH dynamical core at high wavenumbers is a consequence of incorporating the unique diffusion formula of the SPH, which was deliberately designed to retain the small-scale phenomena, following the study of Leith (1971). For high resolutions (i.e., T254 and T510), the KE distribution is quite comparable between the two dynamical cores because they have the same eighth-order horizontal diffusion, as described in section 2c.

### b. Simulated results

The 24-h predicted sea level pressure (SLP) and precipitation are compared in Fig. 5. It is evident that the simulated results for the dynamical cores are similar. The shapes of the contour for both dynamical cores' runs exhibit finer-scale features at higher resolutions. The root-mean-square error (RMSE) and pattern correlation coefficients of the simulated precipitation from the experiments with each core are comparable to the Tropical Rainfall Measuring Mission (TRMM) Multisatellite Precipitation Analysis (TMPA) data (Huffman et al. 2007) over the East Asian region (Table 1). The tabulated numbers confirm a decrease in the maximum rainfall as well as in the domain-averaged amount in the case of the DFS run. For the T62 runs (Figs. 5a,b), the reduction of rainfall amount is directly related to the collapse of the energy spectra at high wavenumbers (see Fig. 4). Despite the similar energy spectra for both core runs in the case of the T254 runs, smoothed contours of the SLP and reduction of rainfall amount are observed (cf. Figs. 5c,d). This can be attributed to a weakened damping near the surface in the SPH dynamical core. It is not straightforward to explain the differences in the simulated results by the characteristics of the energy spectra. The basis functions of the SPH model are isotropic on the sphere, but this is not the case for the DFS model (Cheong et al. 2004; C06). As is commonly done, we obtain the minimum coefficient and/or the order of the Laplacian-type diffusion operator in the DFS model through empirical adjustment at a given resolution.

The seasonal simulations are performed for a boreal summer in June–August (JJA) in 1996. The summer of 1996 is regarded as being normal in terms of tropical sea surface temperatures. To estimate and filter out the unpredictable part of the flow, five-member ensemble runs are performed. The ensemble runs are initialized at 0000 UTC 1, 2, 3, 4, and 5 May. The model employs a resolution of T126 (~100 km). A detailed model setup is available in Byun and Hong (2007).

Figure 6 compares the JJA precipitation from the Climate Prediction Center (CPC) Merged Analysis of Precipitation (CMAP) data (Xie and Arkin 1997) with simulations from the SPH and DFS runs. It is clear that the model reproduces the tropical rainfall satisfactorily for both runs. The intertropical convergence zone (ITCZ) over the Pacific and Atlantic is well simulated and shows a good agreement between the two dynamical cores. Precipitation in relation to Asian monsoons is well depicted. A visible difference in the two core runs appears in the amount of precipitation. The global average for JJA is 105.2 and 102.8 mm for SPH and DFS runs, respectively. The pattern correlation coefficients (PCCs) of precipitation are 0.74 and 0.73 for the SPH and DFS runs, respectively.

The 500-hPa height eddies from the two dynamical cores are compared with the corresponding analysis in Fig. 7. A standing eddy, also called stationary eddy, is defined as the time-averaged departure of the value from zonal mean at the same latitude, and it is used to validate the model's general ability (Hong et al. 2013). Overall, the spatial distributions of the simulated eddy from the SPH and DFS dynamical cores are in good agreement with the reanalysis data. The negative height eddy, stretching from the Indian subcontinent to Japan, is correctly located, but slightly underestimated. The wave over North America is well reproduced in location and intensity, but the positive eddy in the northwestern region is too strong. In the Southern Hemisphere, simulated height patterns show general agreement with the corresponding observations. The averaged PCCs of the ensemble mean over the globe are 0.71 and 0.73 for the SPH and DFS runs, respectively.

The evaluation of the upper-level structures (e.g., pressure–latitude cross sections of the zonal-averaged zonal wind, temperature, and humidity) showed a realistic climatology without a discernible difference for both SPH and DFS runs (not shown). This resemblance of the two simulated climatologies is indicative of the fact that the DFS dynamical core was successfully implemented into a full physics model, and is computationally stable in long-term integration test bed.

## 4. Conclusions

Comparison of the DFS and SPH dynamical cores reveals that the new (DFS) dynamical core does exhibit no discernible deficiencies in terms of the accuracy of the simulated climatology and the forecast of a heavy rainfall event over Korea. Most importantly, the DFS algorithm guarantees computational efficiency in the cluster computer by reducing computational complexity from *O*[*N*^{3}] to *O*[*N*^{2} log_{2}*N*], which provides considerable savings of computing time at high resolutions. With distributed-memory architecture, however, the DFS model still suffers from the inherent scalability problem of the spectral model; namely, the transpose operation (or all-to-all communication) required for transforming wave space to grid space or vice versa.

The overall results of this study imply that, at higher resolutions, the DFS dynamical core can be an alternative choice along with other newly developed gridpoint dynamical cores based on the icosahedral grid configuration (e.g., Ullrich et al. 2012) or the spectral element dynamical core (Gaberšek et al. 2012), because the DFS algorithm has the advantages of both the spectral method for high numerical accuracy and the gridpoint method for high performance computing.

## Acknowledgments

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Science and Technology (2012-0000158). The authors would like to acknowledge Jihyeon Jang for plotting some of the figures.

## REFERENCES

*Numerical Techniques for Global Atmospheric Models,*P. H. Lauritzen et al., Eds., Lecture Notes in Computational Science and Engineering, Vol. 80, Springer, 381–493.

*. General Circulation Model Development,*D. A. Randall, Ed., Vol. 70, Academic Press, 509–538.