Reducing the computational cost of weather and climate simulations would lower electric energy consumption. From the standpoint of reducing costs, the use of reduced precision arithmetic has become an active area of research. Here the impact of using single-precision arithmetic on simulation accuracy is examined by conducting Jablonowski and Williamson’s baroclinic wave tests using the dynamical core of a global fully compressible nonhydrostatic model. The model employs a finite-volume method discretized on an icosahedral grid system and its mesh size is set to 220, 56, 14, and 3.5 km. When double-precision arithmetic is fully replaced by single-precision arithmetic, a spurious wavenumber-5 structure becomes dominant in both hemispheres, rather than the expected baroclinic wave growth only in the Northern Hemisphere. It was found that this spurious wave growth comes from errors in the calculation of gridcell geometrics. Therefore, an additional simulation was conducted using double precision for calculations that only need to be performed for model setup, including calculation of gridcell geometrics, and single precision everywhere else, meaning that all calculations performed each time step used single precision. In this case, the model successfully simulated the growth of the baroclinic wave with only small errors and a 46% reduction in runtime. These results suggest that the use of single-precision arithmetic will allow significant reduction of computational costs in next-generation weather and climate simulations using a fully compressible nonhydrostatic global model with the finite-volume method.
Advances in computer performance enable us to develop global numerical weather prediction (NWP) models of high accuracy by increasing resolution and implementing sophisticated physics schemes. For example, operational forecast models in the early 1990s had a horizontal resolution of about 100 km. Today’s state-of-the-art operational models have a horizontal resolution of about 10 km (e.g., ECMWF 2016). These advances in global NWP models enable us to accurately predict the Madden–Julian oscillation (Miyakawa et al. 2014; Vitart 2014; Kim et al. 2014), a major source of predictability on subseasonal (2 weeks–2 months) time scale and its related tropical cyclogenesis (Nakano et al. 2015, 2017; Xiang et al. 2015).
Presently, massive parallel supercomputers require a huge amount of electric power. For example, the K computer (Miyazaki et al. 2012), a 10 Peta floating-point operations per second (FLOPS) super computer in Japan, requires approximately 13 MW of power for full-system operation, which corresponds to the energy demand of 30 000 houses in Japan. Therefore, approaches to improve computational and energy efficiency are desired for sustainable advances in high-performance computers and thus NWP models. The development of new hardware, such as general purpose computing on graphics processing units, many integrated core architecture, and field-programmable gate arrays, rather than conventional central processing units (CPUs), facilitates the improvement of the number of FLOPs per unit energy (e.g., Düben et al. 2015; Govett et al. 2017).
Because CPUs often wait for data from relatively slow devices (e.g., the main memory, a network, or a disk), minimizing the level of data transfer between a CPU and a slow device would reduce total computational cost. In particular, the use of 32-bit (single) precision arithmetic reduces the data size transferred between a CPU and main memory and also reduces the communication between computational nodes, thereby reducing runtime (Düben et al. 2014; Düben and Palmer 2014; Düben et al. 2015). However, there have been few attempts at running a NWP model with single precision (Váňa et al. 2017). Váňa et al. (2017) examined the impact of single-precision arithmetic using the Integrated Forecast System (IFS) of the European Centre for Medium-Range Weather Forecasts (ECMWF). The IFS has a horizontal resolution of about 50 km and solves the hydrostatic equations discretized by a spectral method. A global nonhydrostatic model with the finite-volume method is one of the promising tools for achieving the mesh simulation of a few kilometers (or even less; e.g., Miyamoto et al. 2013). Using a global nonhydrostatic model with an icosahedral grid system, Yashiro et al. (2016) proposed a new framework for high-resolution large-ensemble data assimilation. In the proposed framework, the simulated grid data are relocated as it is suitable for ensemble data assimilation; owing to this relocation, each computational node can collect all the ensemble members for a geographical region before entering the data assimilation cycle. The framework can markedly decrease file inputs and outputs and improve the total throughput of the cycle. Kodama et al. (2014) developed a message passing interface (MPI)-rank mapping algorithm to minimize the number of hops for exchanging point-to-point data on the 3D torus network topology, thereby reducing the communication cost. Govett et al. (2017) showed good performance of the Nonhydrostatic Icosahedral Model (NIM) using single-precision arithmetic.
This study examines the differences between using single and double precision in different parts of a global Nonhydrostatic Icosahedral Grid Atmospheric Model (NICAM; Satoh et al. 2014) and demonstrates that some setup calculations of NICAM need to be done using double precision. Using a horizontal resolution of up to 3.5 km, we demonstrate, using Jablonowski and Williamson’s baroclinic wave benchmark test (Jablonowski and Williamson 2006, hereafter JW06), that (i) the model with only single-precision arithmetic cannot reproduce the results of the model with double-precision arithmetic, and (ii) that use of mixed-precision arithmetic (where single-precision arithmetic is used in most parts but double-precision arithmetic is used in some parts) can effectively reduce total runtime with little loss of accuracy.
2. Experimental design for JW06 benchmark test
The JW06 benchmark test examines the time evolution of baroclinic waves, which are initiated by a zonal wind perturbation in the Northern Hemisphere in the equatorially symmetric balanced fields. The test has been widely used to examine the performance of dynamical cores (e.g., Skamarock et al. 2012). We performed the JW06 test using the dynamical core of the NICAM (NICAM-DC, https://scale.aics.riken.jp/nicamdc/), which solves the fully compressible Navier–Stokes equations using a finite-volume method. The model has a globally quasi-uniform horizontal grid spacing. We employed GL05 (220 km), GL07 (56 km), GL09 (14 km), and GL11 (3.5 km) horizontal resolutions. The model employs a time-splitting scheme (Klemp and Wilhelmson 1978), in which the slow (fast) mode is integrated with a large (small) time step. The three-step Runge–Kutta scheme (Wicker and Skamarock, 2002) was used for the large time integration. The horizontally explicit and vertically implicit schemes (Satoh 2002) were used for the small time integration. The JW06 test defined geopotential height at the initial time point [see their Eqs. (7)–(9)]. Since the model used here employs a terrain-following, height-based coordinate in a vertical direction, we imposed a terrain to set the geopotential height, which is equivalent to that of a pressure-based coordinate model. We set the number of vertical layers to 45 and the model top was located at 45-km height. All numerical simulations were conducted on the K computer operated by RIKEN Advanced Institute of Computational Science. A computational node has a CPU [SPARC64 VIIIfx 2 GHz (128 GFLOPS)] that consists of eight cores. Computational nodes are connected by Tofu interconnect, a six-dimensional mesh-torus topology. The model code was compiled with the option that enabled vectorization using a single instruction multiple data (SIMD) instruction. It should be noted that the computational performance when calculating floating-point operations for single- and double-precision arithmetic were the same in this system even when the SIMD instruction was enabled because of the specifications of the system.
In this study, we performed three series of experiments, termed DBL, SGL, and MIX. In the DBL experiments, we used double-precision arithmetic for all variables and operations, as in the conventional NWP models. In the SGL experiments, all the variables were declared as single precision so that all the calculations were conducted in single precision. As we will show in the next section, the SGL experiments failed to reproduce the DBL results because of inaccurate treatment of control volume geometrics. Motivated by this, in the MIX experiments, the time integration loop was performed by single-precision arithmetic but some model setup procedures were conducted by double-precision arithmetic.
Figure 1 shows a flowchart of the NICAM-DC. The grid setup defines the position of vectors from Earth’s center to gravity centers, vertices, and the middle points of arcs of control volumes, together with the vertical coordinates. The geometrics setup defines the length and tangential and normal vectors of each arc of a control volume, together with the area of the control volume. The operator setup calculates coefficients to be used for calculation of divergence and gradient.
The following code modification was performed in the MIX experiments:
Calculations in the grid setup and geometrics setup were performed by double precision.
Calculations of the coefficients defined in the operator setup were performed by double precision but the coefficients themselves were kept as single-precision numbers.
MPI communication of single- and double-precision variables was performed in single and double precision, respectively.
Váňa et al. (2017) also demonstrated that the setup for Legendre transformation and integral operators of the vertical finite-element scheme should be performed in double precision. These modifications enabled us to obtain almost the same result as simulated in the DBL experiments, as shown in the following section.
Figure 2 shows surface pressure patterns at day 9, as simulated by the GL05 (220 km) and GL07 (56 km) resolutions. The baroclinic waves grew in the DBL simulation, as shown in JW06 (Figs. 2a and 2d). For the SGL simulation with the GL05 resolution (Fig. 2c) the development of the baroclinic wave was again simulated but a spurious wavenumber-5 structure was superimposed in both hemispheres. These spurious waves became dominant for the SGL simulation with GL07 resolution (Fig. 2f). On the other hand, the MIX simulations (Figs. 2b and 2e) successfully reproduced a pattern similar to that of the DBL simulations (Figs. 2a and 2d).
Detailed analysis showed that the calculation of angles in the geometrics setup is highly critical for successful simulations. The arc length of a control volume (hexagonal shape) is calculated using the central angle between the position vectors of vertices of a control volume, which can be derived from the inner product of the position vectors. In the high-resolution model, calculation of the central angle suffered from the cancellation of significant digits. The area of a control volume was calculated by summation of the areas of six triangles, which were derived using spherical trigonometry. In this procedure, we must calculate the interior angles of the triangles on a sphere. This calculation also suffered from the cancellation of significant digits. These geometric values are used in operator setup (e.g., calculating the coefficients to be used for divergence and gradient operations); thus, they affect the time integration loop. Because NIM uses a local coordinate system that maps the surface of the Earth to a plane (Lee and MacDonald 2009), it can avoid the need for double precision for these calculations.
We also performed finer-resolution experiments to examine the performance of the MIX simulations. Figure 3 shows the surface pressure simulated in the DBL experiments and the difference between the MIX and DBL experiments up to GL11 (3.5 km) resolution. Whereas small-scale differences are large in GL11 (3.5 km) simulation, the magnitude of the difference is still O(0.01) hPa, which is negligible in practical use even at GL11 (3.5 km) resolution. These findings indicate that the MIX experiments successfully maintained the quality of simulation found in the conventional double-precision icosahedral global grid model.
Figure 4 shows the differences in total air mass from day 1. Because the model employs a mass conservative scheme (Satoh 2002), the difference occurs due to the round-off errors. The differences are observed to be O(10−9%) and O(10−4%) for DBL and MIX experiments, respectively. As the model resolution increases, the differences are observed to increase in DBL experiments. On the other hand, the differences oscillate in MIX experiments. Because no systematic drift is observed in MIX experiments, impact of use of single precision is acceptable for practical use.
To quantify the model error, we examined the time evolution of the l2 difference norm between the MIX and DBL experiments for each resolution (Fig. 5). The l2 difference norm was calculated following Eq. (16) of JW06:
where Ps, λ, ϕ, t, and R are surface pressure, longitude, latitude, time, and Earth’s radius, respectively. Here n is the identifier of a control volume and A is the area of a control volume. JW06 compared the l2 difference norm between different global models with a horizontal resolution of 30–40 km and showed that the norm was about 3 × 10−2 hPa up to day 6 before growing exponentially to about 1 hPa at day 11 (see their Fig. 10). Therefore, whether the norm is smaller than that shown by JW06 is a metric for whether the MIX experiments showed essentially the same precision as the DBL experiments. It should be noted that the l2 difference norm becomes intrinsically large in the high-resolution models because small-scale perturbations can be resolved in such models. The l2 difference norm at GL05–09 remained below 7 × 10−3 hPa until day 6 and then grew exponentially. The norm at day 11 was about 10−2 hPa, which is two orders of magnitude smaller than the 1 hPa, value shown by JW06. The norm at GL11 had a larger growth rate than those at lower resolutions, but was approximately 10−1 hPa at day 11. This larger value is caused by small-scale differences seen in Fig. 3d. However, the magnitude of the l2 difference norm is still smaller than that in JW06 by an order of magnitude, making it adequately acceptable for practical use.
Table 1 compares the elapsed time for 11-day integration on the K computer. The number of grids for each computational node is the same in experiments of resolution finer than GL07 (56 km). The MIX experiments reduced the elapse time by about 46% of the DBL experiments (a 1.8 times faster calculating speed). Fully single-precision operation in the time integration loop caused the speed up in MIX precision (the coefficient used in the differential operation was calculated by double precision but the coefficient used in the time integration loop was kept at single precision). Therefore, the amount of data transferred between the CPU and main memory, as well as that between the nodes in the main loop, are approximately halved. This reduction in data size contributed to the speed up. The speed up ratio was smaller in GL05 because the problem size per computational node was smaller than that for the other experiments.
In this study, we examined the impact of the use of 32-bit (single) precision arithmetic by conducting Jablonowski and Williamson’s baroclinic wave tests using the dynamical core of a global Nonhydrostatic Icosahedral Atmospheric Model (NICAM-DC). Fully single-precision arithmetic experiments with a fine mesh size (<56 km) completely failed to reproduce the baroclinic wave growth seen in fully 64-bit (double) precision arithmetic experiments. When changing back to double precision in some model setup procedures (e.g., calculating the geometrics of control volumes), we obtained the same quality of simulation (even at a 3.5-km horizontal resolution) and the calculation could be performed 1.8 times faster than in the conventional double-precision model.
In the future, we aim to examine the impact of single-precision arithmetic using a full model that includes physics schemes. A realistic test case is also necessary. NICAM has good performance in simulating the Madden–Julian oscillation and tropical cyclogenesis (Miyakawa et al. 2014; Nakano et al. 2015). Because this high performance may be impacted by the representation of clouds, examining how the representation of clouds is impacted by the use of single-precision arithmetic would be helpful for maintaining model performance. The examination of higher resolutions is also required. In such models, the pressure gradient may suffer from the cancellation of significant digits. Longer integration, such as in the Held–Suarez test (Held and Suarez 1994), is also needed to assess the applicability of the single-precision arithmetic to a climate model since the cancellation of significant digits may violate mass and energy conservation.
This study provides valuable information for future weather and climate modeling. Owing to advances in high-performance computers, weather and climate models have continuously increased in resolution and the level of model complexity. Such computers may become faster in terms of “peak” performance. However, considering the current trend for the ratio of memory bandwidth (bytes per second) per FLOPS ratio (B/F) to become smaller, advances in weather and climate models, for which performance is typically bound by memory bandwidth rather than peak performance, would depend on how well the models adapt to small B/F computers. If this adaptation is poor, advances in weather and climate modeling may slow down. The results of this study demonstrate that the use of single-precision arithmetic is a potential approach for facilitating this adaptation to small B/F computers in the future.
The authors are grateful to the editors and anonymous reviewers. This study was undertaken as part of the FLAGSHIP 2020 Project supported by Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. All numerical experiments are conducted on the K computer (Proposals hp150287, hp160230, and hp170234) of RIKEN Advanced Institute for Computational Science.
© 2018 American Meteorological Society.