1. Introduction
Advances in computer performance enable us to develop global numerical weather prediction (NWP) models of high accuracy by increasing resolution and implementing sophisticated physics schemes. For example, operational forecast models in the early 1990s had a horizontal resolution of about 100 km. Today’s state-of-the-art operational models have a horizontal resolution of about 10 km (e.g., ECMWF 2016). These advances in global NWP models enable us to accurately predict the Madden–Julian oscillation (Miyakawa et al. 2014; Vitart 2014; Kim et al. 2014), a major source of predictability on subseasonal (2 weeks–2 months) time scale and its related tropical cyclogenesis (Nakano et al. 2015, 2017; Xiang et al. 2015).
Presently, massive parallel supercomputers require a huge amount of electric power. For example, the K computer (Miyazaki et al. 2012), a 10 Peta floating-point operations per second (FLOPS) super computer in Japan, requires approximately 13 MW of power for full-system operation, which corresponds to the energy demand of 30 000 houses in Japan. Therefore, approaches to improve computational and energy efficiency are desired for sustainable advances in high-performance computers and thus NWP models. The development of new hardware, such as general purpose computing on graphics processing units, many integrated core architecture, and field-programmable gate arrays, rather than conventional central processing units (CPUs), facilitates the improvement of the number of FLOPs per unit energy (e.g., Düben et al. 2015; Govett et al. 2017).
Because CPUs often wait for data from relatively slow devices (e.g., the main memory, a network, or a disk), minimizing the level of data transfer between a CPU and a slow device would reduce total computational cost. In particular, the use of 32-bit (single) precision arithmetic reduces the data size transferred between a CPU and main memory and also reduces the communication between computational nodes, thereby reducing runtime (Düben et al. 2014; Düben and Palmer 2014; Düben et al. 2015). However, there have been few attempts at running a NWP model with single precision (Váňa et al. 2017). Váňa et al. (2017) examined the impact of single-precision arithmetic using the Integrated Forecast System (IFS) of the European Centre for Medium-Range Weather Forecasts (ECMWF). The IFS has a horizontal resolution of about 50 km and solves the hydrostatic equations discretized by a spectral method. A global nonhydrostatic model with the finite-volume method is one of the promising tools for achieving the mesh simulation of a few kilometers (or even less; e.g., Miyamoto et al. 2013). Using a global nonhydrostatic model with an icosahedral grid system, Yashiro et al. (2016) proposed a new framework for high-resolution large-ensemble data assimilation. In the proposed framework, the simulated grid data are relocated as it is suitable for ensemble data assimilation; owing to this relocation, each computational node can collect all the ensemble members for a geographical region before entering the data assimilation cycle. The framework can markedly decrease file inputs and outputs and improve the total throughput of the cycle. Kodama et al. (2014) developed a message passing interface (MPI)-rank mapping algorithm to minimize the number of hops for exchanging point-to-point data on the 3D torus network topology, thereby reducing the communication cost. Govett et al. (2017) showed good performance of the Nonhydrostatic Icosahedral Model (NIM) using single-precision arithmetic.
This study examines the differences between using single and double precision in different parts of a global Nonhydrostatic Icosahedral Grid Atmospheric Model (NICAM; Satoh et al. 2014) and demonstrates that some setup calculations of NICAM need to be done using double precision. Using a horizontal resolution of up to 3.5 km, we demonstrate, using Jablonowski and Williamson’s baroclinic wave benchmark test (Jablonowski and Williamson 2006, hereafter JW06), that (i) the model with only single-precision arithmetic cannot reproduce the results of the model with double-precision arithmetic, and (ii) that use of mixed-precision arithmetic (where single-precision arithmetic is used in most parts but double-precision arithmetic is used in some parts) can effectively reduce total runtime with little loss of accuracy.
Section 2 shows the experimental design and explains which parts are calculated by single/double precision and why. Section 3 discusses the results of the JW06 test. Section 4 describes the conclusions and future work.
2. Experimental design for JW06 benchmark test
The JW06 benchmark test examines the time evolution of baroclinic waves, which are initiated by a zonal wind perturbation in the Northern Hemisphere in the equatorially symmetric balanced fields. The test has been widely used to examine the performance of dynamical cores (e.g., Skamarock et al. 2012). We performed the JW06 test using the dynamical core of the NICAM (NICAM-DC, https://scale.aics.riken.jp/nicamdc/), which solves the fully compressible Navier–Stokes equations using a finite-volume method. The model has a globally quasi-uniform horizontal grid spacing. We employed GL05 (220 km), GL07 (56 km), GL09 (14 km), and GL11 (3.5 km) horizontal resolutions. The model employs a time-splitting scheme (Klemp and Wilhelmson 1978), in which the slow (fast) mode is integrated with a large (small) time step. The three-step Runge–Kutta scheme (Wicker and Skamarock, 2002) was used for the large time integration. The horizontally explicit and vertically implicit schemes (Satoh 2002) were used for the small time integration. The JW06 test defined geopotential height at the initial time point [see their Eqs. (7)–(9)]. Since the model used here employs a terrain-following, height-based coordinate in a vertical direction, we imposed a terrain to set the geopotential height, which is equivalent to that of a pressure-based coordinate model. We set the number of vertical layers to 45 and the model top was located at 45-km height. All numerical simulations were conducted on the K computer operated by RIKEN Advanced Institute of Computational Science. A computational node has a CPU [SPARC64 VIIIfx 2 GHz (128 GFLOPS)] that consists of eight cores. Computational nodes are connected by Tofu interconnect, a six-dimensional mesh-torus topology. The model code was compiled with the option that enabled vectorization using a single instruction multiple data (SIMD) instruction. It should be noted that the computational performance when calculating floating-point operations for single- and double-precision arithmetic were the same in this system even when the SIMD instruction was enabled because of the specifications of the system.
In this study, we performed three series of experiments, termed DBL, SGL, and MIX. In the DBL experiments, we used double-precision arithmetic for all variables and operations, as in the conventional NWP models. In the SGL experiments, all the variables were declared as single precision so that all the calculations were conducted in single precision. As we will show in the next section, the SGL experiments failed to reproduce the DBL results because of inaccurate treatment of control volume geometrics. Motivated by this, in the MIX experiments, the time integration loop was performed by single-precision arithmetic but some model setup procedures were conducted by double-precision arithmetic.
Figure 1 shows a flowchart of the NICAM-DC. The grid setup defines the position of vectors from Earth’s center to gravity centers, vertices, and the middle points of arcs of control volumes, together with the vertical coordinates. The geometrics setup defines the length and tangential and normal vectors of each arc of a control volume, together with the area of the control volume. The operator setup calculates coefficients to be used for calculation of divergence and gradient.

Schematic diagram of the code structure of NICAM-DC.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1

Schematic diagram of the code structure of NICAM-DC.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
Schematic diagram of the code structure of NICAM-DC.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
The following code modification was performed in the MIX experiments:
Calculations in the grid setup and geometrics setup were performed by double precision.
Calculations of the coefficients defined in the operator setup were performed by double precision but the coefficients themselves were kept as single-precision numbers.
MPI communication of single- and double-precision variables was performed in single and double precision, respectively.
3. Results
Figure 2 shows surface pressure patterns at day 9, as simulated by the GL05 (220 km) and GL07 (56 km) resolutions. The baroclinic waves grew in the DBL simulation, as shown in JW06 (Figs. 2a and 2d). For the SGL simulation with the GL05 resolution (Fig. 2c) the development of the baroclinic wave was again simulated but a spurious wavenumber-5 structure was superimposed in both hemispheres. These spurious waves became dominant for the SGL simulation with GL07 resolution (Fig. 2f). On the other hand, the MIX simulations (Figs. 2b and 2e) successfully reproduced a pattern similar to that of the DBL simulations (Figs. 2a and 2d).

Simulated surface pressure (hPa) at day 9 in the (a),(d) DBL; (b),(e) MIX; and (c),(f) SGL experiments with a horizontal resolution of (a)–(c) GL05 (220 km) and (d)–(f) GL07 (56 km).
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1

Simulated surface pressure (hPa) at day 9 in the (a),(d) DBL; (b),(e) MIX; and (c),(f) SGL experiments with a horizontal resolution of (a)–(c) GL05 (220 km) and (d)–(f) GL07 (56 km).
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
Simulated surface pressure (hPa) at day 9 in the (a),(d) DBL; (b),(e) MIX; and (c),(f) SGL experiments with a horizontal resolution of (a)–(c) GL05 (220 km) and (d)–(f) GL07 (56 km).
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
Detailed analysis showed that the calculation of angles in the geometrics setup is highly critical for successful simulations. The arc length of a control volume (hexagonal shape) is calculated using the central angle between the position vectors of vertices of a control volume, which can be derived from the inner product of the position vectors. In the high-resolution model, calculation of the central angle suffered from the cancellation of significant digits. The area of a control volume was calculated by summation of the areas of six triangles, which were derived using spherical trigonometry. In this procedure, we must calculate the interior angles of the triangles on a sphere. This calculation also suffered from the cancellation of significant digits. These geometric values are used in operator setup (e.g., calculating the coefficients to be used for divergence and gradient operations); thus, they affect the time integration loop. Because NIM uses a local coordinate system that maps the surface of the Earth to a plane (Lee and MacDonald 2009), it can avoid the need for double precision for these calculations.
We also performed finer-resolution experiments to examine the performance of the MIX simulations. Figure 3 shows the surface pressure simulated in the DBL experiments and the difference between the MIX and DBL experiments up to GL11 (3.5 km) resolution. Whereas small-scale differences are large in GL11 (3.5 km) simulation, the magnitude of the difference is still O(0.01) hPa, which is negligible in practical use even at GL11 (3.5 km) resolution. These findings indicate that the MIX experiments successfully maintained the quality of simulation found in the conventional double-precision icosahedral global grid model.

Simulated surface pressure at day 9 in the DBL experiments (contour with an interval of 20 hPa) and the difference between the MIX and DBL experiments (shading).
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1

Simulated surface pressure at day 9 in the DBL experiments (contour with an interval of 20 hPa) and the difference between the MIX and DBL experiments (shading).
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
Simulated surface pressure at day 9 in the DBL experiments (contour with an interval of 20 hPa) and the difference between the MIX and DBL experiments (shading).
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
Figure 4 shows the differences in total air mass from day 1. Because the model employs a mass conservative scheme (Satoh 2002), the difference occurs due to the round-off errors. The differences are observed to be O(10−9%) and O(10−4%) for DBL and MIX experiments, respectively. As the model resolution increases, the differences are observed to increase in DBL experiments. On the other hand, the differences oscillate in MIX experiments. Because no systematic drift is observed in MIX experiments, impact of use of single precision is acceptable for practical use.

Time evolution of the differences in the total air mass from day 1 for (a) DBL and (b) MIX experiments, respectively. The black, red, blue, and green curves represent GL05 (220 km), GL07 (56 km), GL09 (14 km), and GL11 (3.5 km), respectively. SGL depicts an identical order of difference as observed in MIX.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1

Time evolution of the differences in the total air mass from day 1 for (a) DBL and (b) MIX experiments, respectively. The black, red, blue, and green curves represent GL05 (220 km), GL07 (56 km), GL09 (14 km), and GL11 (3.5 km), respectively. SGL depicts an identical order of difference as observed in MIX.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
Time evolution of the differences in the total air mass from day 1 for (a) DBL and (b) MIX experiments, respectively. The black, red, blue, and green curves represent GL05 (220 km), GL07 (56 km), GL09 (14 km), and GL11 (3.5 km), respectively. SGL depicts an identical order of difference as observed in MIX.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1



Time evolution of global l2 difference norm of simulated surface pressure between the DBL and MIX experiments. Black, red, blue, and green curves represent GL05 (220 km), GL07 (56 km), GL09 (14 km), and GL11 (3.5 km), respectively. The l2 difference norm between the DBL and SGL experiments at day 11 showed 2 and 18 hPa for GL05 and GL07, respectively.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1

Time evolution of global l2 difference norm of simulated surface pressure between the DBL and MIX experiments. Black, red, blue, and green curves represent GL05 (220 km), GL07 (56 km), GL09 (14 km), and GL11 (3.5 km), respectively. The l2 difference norm between the DBL and SGL experiments at day 11 showed 2 and 18 hPa for GL05 and GL07, respectively.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
Time evolution of global l2 difference norm of simulated surface pressure between the DBL and MIX experiments. Black, red, blue, and green curves represent GL05 (220 km), GL07 (56 km), GL09 (14 km), and GL11 (3.5 km), respectively. The l2 difference norm between the DBL and SGL experiments at day 11 showed 2 and 18 hPa for GL05 and GL07, respectively.
Citation: Monthly Weather Review 146, 2; 10.1175/MWR-D-17-0257.1
Table 1 compares the elapsed time for 11-day integration on the K computer. The number of grids for each computational node is the same in experiments of resolution finer than GL07 (56 km). The MIX experiments reduced the elapse time by about 46% of the DBL experiments (a 1.8 times faster calculating speed). Fully single-precision operation in the time integration loop caused the speed up in MIX precision (the coefficient used in the differential operation was calculated by double precision but the coefficient used in the time integration loop was kept at single precision). Therefore, the amount of data transferred between the CPU and main memory, as well as that between the nodes in the main loop, are approximately halved. This reduction in data size contributed to the speed up. The speed up ratio was smaller in GL05 because the problem size per computational node was smaller than that for the other experiments.
Elapsed time for an 11-day integration.


4. Conclusions
In this study, we examined the impact of the use of 32-bit (single) precision arithmetic by conducting Jablonowski and Williamson’s baroclinic wave tests using the dynamical core of a global Nonhydrostatic Icosahedral Atmospheric Model (NICAM-DC). Fully single-precision arithmetic experiments with a fine mesh size (<56 km) completely failed to reproduce the baroclinic wave growth seen in fully 64-bit (double) precision arithmetic experiments. When changing back to double precision in some model setup procedures (e.g., calculating the geometrics of control volumes), we obtained the same quality of simulation (even at a 3.5-km horizontal resolution) and the calculation could be performed 1.8 times faster than in the conventional double-precision model.
In the future, we aim to examine the impact of single-precision arithmetic using a full model that includes physics schemes. A realistic test case is also necessary. NICAM has good performance in simulating the Madden–Julian oscillation and tropical cyclogenesis (Miyakawa et al. 2014; Nakano et al. 2015). Because this high performance may be impacted by the representation of clouds, examining how the representation of clouds is impacted by the use of single-precision arithmetic would be helpful for maintaining model performance. The examination of higher resolutions is also required. In such models, the pressure gradient may suffer from the cancellation of significant digits. Longer integration, such as in the Held–Suarez test (Held and Suarez 1994), is also needed to assess the applicability of the single-precision arithmetic to a climate model since the cancellation of significant digits may violate mass and energy conservation.
This study provides valuable information for future weather and climate modeling. Owing to advances in high-performance computers, weather and climate models have continuously increased in resolution and the level of model complexity. Such computers may become faster in terms of “peak” performance. However, considering the current trend for the ratio of memory bandwidth (bytes per second) per FLOPS ratio (B/F) to become smaller, advances in weather and climate models, for which performance is typically bound by memory bandwidth rather than peak performance, would depend on how well the models adapt to small B/F computers. If this adaptation is poor, advances in weather and climate modeling may slow down. The results of this study demonstrate that the use of single-precision arithmetic is a potential approach for facilitating this adaptation to small B/F computers in the future.
Acknowledgments
The authors are grateful to the editors and anonymous reviewers. This study was undertaken as part of the FLAGSHIP 2020 Project supported by Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. All numerical experiments are conducted on the K computer (Proposals hp150287, hp160230, and hp170234) of RIKEN Advanced Institute for Computational Science.
REFERENCES
Düben, P. D., and T. N. Palmer, 2014: Benchmark tests for numerical weather forecasts on inexact hardware. Mon. Wea. Rev., 142, 3809–3829, https://doi.org/10.1175/MWR-D-14-00110.1.
Düben, P. D., H. McNamara, and T. N. Palmer, 2014: The use of imprecise processing to improve accuracy in weather & climate prediction. J. Comput. Phys., 271, 2–18, https://doi.org/10.1016/j.jcp.2013.10.042.
Düben, P. D., F. P. Russell, X. Niu, W. Luk, and T. N. Palmer, 2015: On the use of programmable hardware and reduced numerical precision in earth-system modeling. J. Adv. Model. Earth Syst., 7, 1393–1408, https://doi.org/10.1002/2015MS000494.
ECMWF, 2016: IFS documentation—Cy43r1, Part III: Dynamics and numerical procedures. ECMWF, 31 pp., https://www.ecmwf.int/sites/default/files/elibrary/2016/17116-part-iii-dynamics-and-numerical-procedures.pdf.
Govett, M., and Coauthors, 2017: Parallelization and performance of the NIM weather model on CPU, GPU, and MIC processors. Bull. Amer. Meteor. Soc., 98, 2201–2213, https://doi.org/10.1175/BAMS-D-15-00278.1.
Held, I. M., and M. J. Suarez, 1994: A proposal for the intercomparison of the dynamical cores of atmospheric general circulation models. Bull. Amer. Meteor. Soc., 75, 1825–1830, https://doi.org/10.1175/1520-0477(1994)075<1825:APFTIO>2.0.CO;2.
Jablonowski, C., and D. L. Williamson, 2006: A baroclinic instability test case for atmospheric model dynamical cores. Quart. J. Roy. Meteor. Soc., 132, 2943–2975, https://doi.org/10.1256/qj.06.12.
Kim, H., P. J. Webster, V. E. Toma, and D. Kim, 2014: Predictability and prediction skill of the MJO in two operational forecasting systems. J. Climate, 27, 5364–5378, https://doi.org/10.1175/JCLI-D-13-00480.1.
Klemp, J. B., and R. B. Wilhelmson, 1978: The simulation of three-dimensional convective storm dynamics. J. Atmos. Sci., 35, 1070–1096, https://doi.org/10.1175/1520-0469(1978)035<1070:TSOTDC>2.0.CO;2.
Kodama, C., and Coauthors, 2014: Scalable rank-mapping algorithm for an icosahedral grid system on the massive parallel computer with a 3-D torus network. Parallel Comput., 40, 362–373, https://doi.org/10.1016/j.parco.2014.06.002.
Lee, J., and A. E. MacDonald, 2009: A finite-volume icosahedral shallow-water model on a local coordinate. Mon. Wea. Rev., 137, 1422–1437, https://doi.org/10.1175/2008MWR2639.1.
Miyakawa, T., and Coauthors, 2014: Madden–Julian Oscillation prediction skill of a new-generation global model demonstrated using a supercomputer. Nat. Commun., 5, 3769, https://doi.org/10.1038/ncomms4769.
Miyamoto, Y., Y. Kajikawa, R. Yoshida, T. Yamaura, H. Yashiro, and H. Tomita, 2013: Deep moist atmospheric convection in a subkilometer global simulation. Geophys. Res. Lett., 40, 4922–4926, https://doi.org/10.1002/grl.50944.
Miyazaki, H., Y. Kusano, N. Shinjou, F. Shoji, M. Yokokawa, and T. Watanabe, 2012: Overview of the K computer system. Fujitsu Sci. Tech. J., 48, 255–265.
Nakano, M., M. Sawada, T. Nasuno, and M. Satoh, 2015: Intraseasonal variability and tropical cyclogenesis in the western North Pacific simulated by a global nonhydrostatic atmospheric model. Geophys. Res. Lett., 42, 565–571, https://doi.org/10.1002/2014GL062479.
Nakano, M., H. Kubota, T. Miyakawa, T. Nasuno, and M. Satoh, 2017: Genesis of Super Cyclone Pam (2015): Modulation of low-frequency large-scale circulations and the Madden–Julian Oscillation by sea surface temperature anomalies. Mon. Wea. Rev., 145, 3143–3159, https://doi.org/10.1175/MWR-D-16-0208.1.
Satoh, M., 2002: Conservative scheme for the compressible nonhydrostatic models with the horizontally explicit and vertically implicit time integration scheme. Mon. Wea. Rev., 130, 1227–1245, https://doi.org/10.1175/1520-0493(2002)130<1227:CSFTCN>2.0.CO;2.
Satoh, M., and Coauthors, 2014: The Non-hydrostatic Icosahedral Atmospheric Model: Description and development. Prog. Earth Planet. Sci., 1, 18, https://doi.org/10.1186/s40645-014-0018-1.
Skamarock, W. C., J. B. Klemp, M. G. Duda, L. D. Fowler, S. Park, and T. D. Ringler, 2012: A multiscale nonhydrostatic atmospheric model using centroidal Voronoi tesselations and C-grid staggering. Mon. Wea. Rev., 140, 3090–3105, https://doi.org/10.1175/MWR-D-11-00215.1.
Váňa, F., P. Düben, S. Lang, T. Palmer, M. Leutbecher, D. Salmond, and G. Carver, 2017: Single precision in weather forecasting models: An evaluation with the IFS. Mon. Wea. Rev., 145, 495–502, https://doi.org/10.1175/MWR-D-16-0228.1.
Vitart, F., 2014: Evolution of ECMWF sub-seasonal forecast skill scores. Quart. J. Roy. Meteor. Soc., 140, 1889–1899, https://doi.org/10.1002/qj.2256.
Wicker, L. J., and W. C. Skamarock, 2002: Time-splitting methods for elastic models using forward time schemes. Mon. Wea. Rev., 130, 2088–2097, https://doi.org/10.1175/1520-0493(2002)130<2088:TSMFEM>2.0.CO;2.
Xiang, B., and Coauthors, 2015: Beyond weather time-scale prediction for Hurricane Sandy and Super Typhoon Haiyan in a global climate model. Mon. Wea. Rev., 143, 524–535, https://doi.org/10.1175/MWR-D-14-00227.1.
Yashiro, H., K. Terasaki, T. Miyoshi, and H. Tomita, 2016: Performance evaluation of a throughput-aware framework for ensemble data assimilation: The case of NICAM-LETKF. Geosci. Model Dev., 9, 2293–2300, https://doi.org/10.5194/gmd-9-2293-2016.