## 1. Introduction

Numerical models for the atmosphere and ocean have become the fundamental building blocks for today’s weather forecasts. Although global weather models are run on some of the fastest high-performance computing (HPC) systems in the world, the spatial resolution of such models, which is limited by the computational power, is still far from the scale at which many types of weather (deep convective systems for example) actually occur (Shapiro et al. 2010; Shukla et al. 2010).

Nowadays, the steady increase in flop rates and hence the performance of state-of-the-art supercomputers results mainly from the use of an increasing number of processors operating in parallel, and relatively little from the use of faster processors. This presents a key challenge for geophysical modeling: for a weather or climate model to exploit the full potential of a next-generation supercomputer, with a resolution as high as possible, it needs to be able to run on many hundreds of thousands of parallel processors with a performance that does not degrade with the number of processors used. This challenge forces model developers to be aware of the properties of the hardware on which any future model will run, and to “parallelize” the code accordingly.

For hardware development, the large increase in the numbers of processors used in supercomputers presents other types of challenges, including excessive energy demand, or a high probability of hardware failures within a simulation (see, e.g., Bergman et al. 2008). It now appears doubtful that the energy demand of HPC systems can be reduced to an extent that would allow the development of an exascale supercomputer within the time frame predicted by Moore’s law.

The use of alternative hardware to standard CPUs might allow an increase in the performance of weather and climate models and a reduction of computational costs. The most prominent examples of alternative hardware are graphics processing units (GPUs). Several groups are working on enabling at least parts of weather or climate models to run on GPU architectures (see, e.g., Michalakes and Vachharajani 2008; Lapillonne and Fuhrer 2013). The speedup has been found to be significant, but the use of GPUs typically involves the rewriting of the code into an appropriate language such as OpenCL, which is an open standard used for programming CPUs, GPUs, digital signal processors, and field-programmable gate arrays (FPGAs), or CUDA, which is a parallel computing architecture developed by NVIDIA. In recent years, developments such as the Intel Xeon Phi have made a highly parallel structure, similar to a GPU, available to the high-performance computing community without a necessary rewrite of code written for CPUs; however, significant changes in the code will still be necessary for most applications to obtain peak performance.

The use of FPGAs allows a representation of real and integer values at customized precision levels for all variables (see, e.g., Detrey and de Dinechin 2007). FPGAs are integrated circuit designs that can be configured (programmed) by the user. Since the hardware can be customized to the application, the use of FPGAs is promising huge speedups. However, it involves a large amount of technical work to run a model on this hardware. In particular, a complete rewrite of the model code is necessary. Recent papers have shown the great potential for the use of so-called data flow engines, which are based on FPGAs, in atmospheric modeling [see Oriato et al. (2012) for the application of data flow engines to a limited area model and Gan et al. (2013) for a global shallow water model]. In data flow engines, data are streamed from memory through the dataflow engine where thousands of operations can be performed within one computing cycle at relatively low clock frequency.

In this paper, we investigate a recent approach to alternative hardware that is still fairly unknown, despite its great potential: to sacrifice some of the exactness of numerical methods for significant savings in power consumption and/or an increase in performance (see, e.g., Palem 2003, 2005; Sartori et al. 2011; Palem and Lingamneni 2013). The trade-off between power consumption and precision is possible using different types of hardware designs. A first approach investigated so-called stochastic processors. The most intuitive way to develop a stochastic processor is by voltage overscaling, by which the voltage that is applied to some piece of hardware, such as an integrated circuit, is reduced to save energy. A voltage reduction below a critical voltage (the origin of the term overscaling) will cause timing errors (see, e.g., Narayanan et al. 2010). If the architecture of the hardware is changed, the fault rates of the stochastic processors can be reduced substantially (Kahng et al. 2010). The use of stochastic processors would sacrifice bit reproducibility. This could become a challenge for weather forecast centers since it would not be possible to reproduce and study failures in the operational forecast system. However, it is already under discussion, at least for some applications of weather and climate models, if bit reproducibility shall be sacrificed to gain performance by using, for example, more aggressive optimization during compilation. The pressure on bit reproducibility will increase with an increasing number of processors.

A second approach to inexact but efficient hardware is bit-width truncation or probabilistic pruning (Lingamneni et al. 2011), or simply pruning for short. Here, the size of a hardware component is reduced by removing parts of the hardware that are either hardly used, or do not have a strong influence on significant bits. This can be done at several levels. One level is simple bit-width truncation. This truncates a given hardware component to a specific accuracy for the bits of the output of an operation. A second level is logic pruning, which uses the statistics of a specific application to minimize hardware errors [see Düben et al. (2014a) for tests with the Lorenz ’96 model].

A third approach to inexact computing is that of mixed-precision arithmetic, which uses different levels of precision beyond that of single or double precision. The number of bits needed to represent floating-point numbers in a numerical program is reduced to a minimal level, and the amount of data can be reduced correspondingly [the program is changed to be “information efficient,” see Palem (2014)]. Reduced precision implies that more data fit into the memory and cache. It is therefore likely that memory bandwidth, which is often the major bottleneck for performance in weather and climate models, can be reduced. We expect the error pattern for bit-width truncation and reduced-precision arithmetic to be similar, since both methods limit accuracy to a given level.

To our best knowledge, no real “hardware” is available yet to perform simulations with inexact hardware, except for the use of single precision or FPGAs. If the code is not vectorized, a change from single- to double-precision floating-point numbers does not necessarily increase the peak flop rate on standard CPUs, which are optimized for 64-bit processing. However, the amount of data is reduced by a half, and more data fit into memory and cache. If the speed of an application is limited by the available cache or memory, rather than by the available flop rate, the use of a single-precision arithmetic can speed up the computation by a factor of up to 2. The amount of memory needed by a weather or climate model depends on the numerical method and resolution used; the amount of data transport needed depends on the number of computing nodes used in a calculation. Therefore, single precision can speed up some setups for numerical models significantly, while there is hardly any benefit for others. There will always be a significant increase in performance if hardware is used which can run single-precision operations with increased flop rates compared to double-precision operations.

It is very difficult to predict possible savings in power consumption and performance for inexact hardware. The savings will obviously depend on the parts of the hardware that are transformed into being inexact, for example, the floating point unit, the memory, the cache, the storage, etc. The savings will also depend on the application that is being considered, for example, in terms of memory bandwidth and data volume. However, first estimates for energy savings are very promising; see for example Lingamneni et al. (2011), who show the possibility to reduce the area-delay-energy product by a factor of up to 15 on idealized architectures with an expected relative error magnitude of 10% or less for an inexact adder, or see Düben et al. (2014a) for a simulation of the Lorenz ’96 model with emulated pruned hardware that would lead to 54%, 49%, and 16% savings in area, power, and delay for the floating point adder–subtractor block and 76%, 75%, and 22% savings in area, power, and delay for the floating point multiplier block, without serious penalties.^{1}

It is also difficult to predict the impact of inexact hardware on forecast accuracy, especially since the atmosphere is a chaotic system. In a previous paper (Düben et al. 2014b), the application of stochastic processors and reduced floating point precision was investigated in a series of T42 low-resolution climate simulations in the spectral dynamical core of the atmosphere model known as the Intermediate Global Climate Model (IGCM) or Reading Spectral Model.^{2} Climatic quantities such as the mean zonal velocity, the transient eddy momentum, and the energy spectra of simulations made over a period of 10 000 days were studied based on the so-called Held–Suarez configuration (Held and Suarez 1994). The results show that there is great potential for the use of either stochastic processors or heavily reduced precision in large parts of the model, as long as the most sensitive parts are calculated on exact hardware. It was shown that scale separation is necessary in order to isolate these most-sensitive parts. In particular, it was necessary that the level of exactness increased with decreasing wavenumber and, thus, with larger spatial scales. This behavior was predicted in Palmer (2012) on the basis that the small-scale dynamics are most strongly affected by the subgrid parameterizations whose functional forms are necessarily approximate.

In this paper, work with the IGCM is continued, performing weather forecasts at higher resolutions. A detailed analysis of the sensitivity of the different parts of IGCM to different levels of inexactness is performed. We will also present benchmark simulations for which as many parts of the model as possible are simulated on emulated inexact hardware, without a serious degradation to the results. Since the basic approach of our research is to use inexact hardware to achieve a reduction of computational cost thus allowing higher resolution for a given energy resource, we will compare the forecast quality of the benchmark simulations with simulations on exact hardware of lower resolution. This provides an estimate of the total savings that would be necessary for inexact hardware to become competitive with exact hardware.

Since there is no inexact “hardware” available to run the IGCM in its current form, the inexact hardware is emulated within the simulations. We use two different configurations for hardware emulation that are designed to cover a wide range of types of inexact hardware. The first setup induces bit flips, with prescribed probability, on all bits of the significand of the result of floating-point operations, to mimic the use of stochastic processors. The second configuration will round the result of floating-point operations to a specific number of bits. The resulting error pattern will be similar to that associated with bit-width truncation or pruning and mixed-precision arithmetic. The emulator can only provide an estimate of the performance on real stochastic processors or mixed-precision hardware, but it is claimed that our results will be of interest to the communities developing hardware on the one hand, and numerical weather prediction models on the other.

The last part of the paper presents results for numerical simulations of a full weather forecast model with single-precision floating point numbers, using the Open Integrated Forecast System (OpenIFS) developed at the European Centre for Medium-Range Weather Forecasts (ECMWF). We will give an overview on the changes that are necessary to change from double to single precision and compare the accuracy between a double- and a single-precision forecast. We will also discuss the possible speed up with single precision on CPU clusters. We are not the first to run a full weather model in single precision. The Navy Operational Global Atmospheric Prediction System (NOGAPS) is a global weather model based on spectral methods that works in single precision (Hogan and Rosmond 1991) and it was shown recently that single precision can reduce computing time with no strong influence on model quality for the Consortium for Small-Scale Modeling system (COSMO; Rüdisühli et al. 2013), to name just two examples. However, we present simulations with OpenIFS in single precision to prove that computational cost could be reduced in many cases without a strong influence on the accuracy of many weather and climate models, since numerical precision is often overengineered.

Section 2 shows the results for emulated inexact hardware in IGCM. Section 3 presents the results for simulations with OpenIFS in single precision. Section 4 provides a discussion of the results.

## 2. Simulations with IGCM on emulated inexact hardware

This section gives a short introduction to the model used, and to the emulator of inexact hardware. Afterward, the setup for the numerical simulations and the results are presented.

### a. The Intermediate Global Climate Model (IGCM)

The IGCM is a dynamical core of a global atmosphere model (Hoskins and Simmons 1975; Simmons and Burridge 1981; Blackburn 1985; James and Dodd 1993). It simulates the primitive equations on the sphere. The horizontal direction is discretized using the so-called spectral discretization scheme for which the physical fields are represented in a space of global, orthogonal basis functions, the so-called spherical harmonics. In the vertical direction, *σ* coordinates are used.

To calculate the nonlinear parts of the prognostic equations, it is necessary to transform the physical fields into grid point space (i.e., onto a Gaussian grid). To do this, a Legendre transformation and a Fourier transformation need to be performed in succession. The Fourier transformation is calculated using the fast Fourier transformation (FFT) algorithm. To transform the calculated contributions back to the space of spherical harmonics, the FFT and the Legendre transform are performed in reversed order. Both transformations contribute a significant amount of the computational workload in a spectral atmosphere model. Most importantly, the computational cost of the Legendre transform becomes a serious challenge for high-resolution simulations, since it scales with *O*(*MN*^{2}), where *M* and *N* are the number of longitudes and latitudes, respectively.

### b. An emulator for inexact hardware

The aim of this work is to make a detailed analysis of the sensitivity of the different parts of IGCM to different kinds of inexact hardware. To this end, an emulator that mimics the use of inexact hardware at the software level is used within the numerical simulations. This study is limited to the processing unit and particularly to floating point numbers [see a short introduction to floating point numbers according to the Institute of Electrical and Electronics Engineers (IEEE) standard 754 in appendix A]. Since Fortran is based on the IEEE standard and since it is likely that some of the approaches to inexact hardware will maintain many features of the IEEE standard (e.g., pruning and stochastic processors), the investigation in this paper will be based on the IEEE standard as well. However, floating point numbers of different shape, as well as fixed point representations, should be kept in mind for future research on inexact hardware.

Floating point arithmetic can be assumed to be the most expensive part of an atmospheric model, since physical quantities are represented as floating point numbers and the most common operations are floating point additions, subtractions, and multiplications.

The emulator operates in two different configurations that represent different types of inexact hardware. The first configuration considers hardware faults that potentially influence the full range of bits of the significand. Such hardware faults would be expected in the use of stochastic processors (see Sloan et al. 2010). The second configuration emulates the use of reduced precision or bit-width truncated architectures, reducing the accuracy of floating point operations to a fixed number of bits in the significand.

To study the use of stochastic processors, a fault rate, *p* (0 ≤ *p* ≤ 1), is fixed. The emulator will randomly draw a number from a uniform distribution between 0 and 1, for each floating point operation that is calculated (such as multiplications, sums, but also cosines, sines, etc.). If the random number is smaller than the fault rate *p*, a fault will be introduced into the operation by flipping one of the 52 bits of the significand. The specific bit will be selected randomly with equal probability for all bits. Similar emulators were used in Kong (2008), Sloan et al. (2010), and Düben et al. (2014b).

To emulate the use of reduced precision arithmetic units and bit-width truncation, the results of all floating point operations are rounded to the closest representation possible with a significand of a given reduced number of bits. The results of the emulator are slightly different compared to the rounding that would happen in real reduced-precision hardware. Here, rounding will also happen for intermediate steps when performing a floating point operation. However, the rounding errors that are caused by real hardware rounding can be assumed to be similar, since guard digits will keep the rounding errors of intermediate steps of floating point operations as small as possible. The emulator is only a rough benchmark for simulations with bit-width truncated or pruned hardware since the actual error pattern for pruning will be influenced by the specific algorithm with which the pruning method is performed (see, e.g., Düben et al. 2014a). However, the assessment criteria for pruning can be a specific level of significance that corresponds to a specific number of bits in the significand.

For both configurations, the emulator will work with two different levels of inexactness. We will work with a strong reduction of exactness by using either a fault rate of *p* = 0.1 for the first configuration and by rounding to only 6 bits in the significand for the second configuration, or we will use a weaker reduction of exactness by using a fault rate of *p* = 0.01 for the first configuration and round to 8 bits in the significand for the second configuration. We use these values since initial tests revealed that the use of a fault rate of more than *p* = 0.1 or less than 6 bits in the significand leads to simulation crashes in many parts of the model. In contrast, most parts of the model seem to work with no problems, if emulated hardware with a fault rate of *p* = 0.01 or 8 bits in the significand is used.

### c. Simulations and results

To evaluate the IGCM, it is run in the Held–Suarez test case configuration. This involves the relaxation of temperature against a prescribed, zonally symmetric field (Held and Suarez 1994). We perform a control simulation (“truth run”) with a horizontal resolution of T159 and 20 vertical levels over 1500 days. The first 1000 days are treated as spinup and are considered no further. The prognostic fields of the last 500 days of the simulation are used to initialize simulations (“forecast runs”) at lower horizontal resolutions (T42, T63, T73, T77, and T85, with 20 vertical levels) that are either performed on exact, or on emulated inexact, hardware.

To initialize a lower-resolution model, the physical fields of the control simulation are simply truncated at the coarser resolution. Since we have not found any significant increases to the divergence field as a result of this truncation, we assume that no further filtering schemes are required. For each lower-resolution configuration, twenty 15-day forecasts are made from different initial dates. These initial dates are taken from the control integration; pairs of initial dates are separated by 25 simulated days. Since the T159 control simulation is taken as truth, it is also used to verify the lower-resolution forecasts. To allow a fair comparison between the “forecasts” at different resolutions, we truncate the physical fields at T42 before calculating forecast errors.

The subsequent subsections present detailed studies of the sensitivities of different parts of the model to the use of stochastic processors or reduced precision hardware. To this end, section 1 investigates the time-stepping scheme, section 2 investigates calculations on the Gaussian grid and for the FFT, and section 3 studies the Legendre transformations. Finally, in section 4 we make benchmark simulations in which we calculate as many parts of a simulation as possible on inexact hardware while keeping the quality of the simulations as high as possible.

The most important diagnostics that are evaluated in the following will be the mean error of geopotential height at the 10th vertical level (approximately 500 hPa) plotted against time, and snapshots of surface pressure after 15 days of integration. The plots of the mean error of geopotential height will show an exponential growth for small forecast times, since two chaotic systems that are slightly perturbed will diverge exponentially in time. The error levels off and converges toward a mean difference at around 70 m for long forecast times, which represents the mean difference between two uncorrelated states of the system. We also discuss estimates for the computational cost associated with different parts of the model at T85 resolution. To this end, we present ratios of the execution time spent in the different parts of the model integration for a T85 control simulation on an Intel i7 CPU. The numbers are calculated with the GNU profiler gprof. The ratios can only serve as a best guess for the real computational cost, since they do not embrace the exact energy consumption and depend on the compiler, the architecture, and the resolution.

#### 1) Time-stepping scheme

In terms of execution time, about 8% of the numerical cost is associated with the semi-implicit time-stepping scheme. Here, we count all calculations in spectral space that are performed between the two Legendre transforms into the time-stepping scheme, namely the adiabatic time step, the calculation of the spectral tendencies for restoration and biharmonic diffusion, and the diabatic part of the time step. If the full time-stepping scheme is simulated with the stochastic emulator, the model simulations crash immediately, or within a couple of days. To obtain stable simulations, it is necessary to perform a scale separation and calculate the large-scale dynamics on exact hardware, while calculating on emulated inexact hardware the small-scale dynamics for spherical harmonics with a global wavenumber greater than either 20 or 30. We denote these simulations as WN 21 and WN 31 in the following.

Figures 1 and 2 show the forecast error of geopotential height and the surface pressure field, respectively, for simulations with emulated stochastic processors and with reduced floating point accuracy. We do not expect the surface pressure fields to be identical after 15 days for different model setups, due to the chaotic nature of the atmosphere. While the simulations with a fault rate of *p* = 0.1 and with an emulated 6-bit significand show a large forecast error and strong perturbations in the surface pressure field, the simulations with a fault rate of *p* = 0.01 and with an 8-bit significand do not show any anomalous behavior compared with the T85 control simulation with double precision. For the simulations with strong perturbations (*p* = 0.1 or 6-bit significand) it is visible that scale separation has a notable impact since there is a strong reduction of the forecast error if the minimal wavenumber that is perturbed is increased from 21 to 31, while the computational cost of calculating the wavenumbers between T21 and T31 is small (cf. WN 21 *p* = 0.1 and WN 31 *p* = 0.1 or WN 21 with 6 bits and WN 31 with 6 bits in Fig. 1). The surface pressure fields with an emulated 6-bit significand for the WN 21 and WN 31 simulations show clear spurious structures aligned with the longitudes. This is surprising since isotropic perturbations would be expected if the representation of the spherical harmonics is perturbed upward of a particular global wavenumber, for the triangular truncation method used. It is hard to identify the reason for the nonisotropic pattern, but we argue that it can be explained by the nature of error patterns of reduced precision hardware. If all relevant terms are calculated without rounding errors in significant digits, errors remain small. However, if a relevant part of the calculation has relevant digits in the range of rounding errors, significant errors develop. This seems to be the case for spherical harmonics with fast variations in the zonal direction in the tropics in the simulations with six emulated bits in the significand, while the patterns are gone for eight emulated bits. If zonal and meridional flows show different magnitudes for different terms, which can be assumed to be true for the strong zonal flows in the tropics, the error pattern might show nonisotropic structures. However, this is pure speculation.

Time-stepping scheme. Snapshot of surface pressure after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Time-stepping scheme. Snapshot of surface pressure after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Time-stepping scheme. Snapshot of surface pressure after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

#### 2) Gaussian grid and fast Fourier transformation

About 9% of the numerical cost in terms of execution time is associated with calculations on the Gaussian grid and 20% by the FFT. Neither for the calculations on the Gaussian grid, nor for the calculations of the FFT, can we easily introduce a scale separation: the operations on the Gaussian grid are not performed in spectral space, and the FFT exploits combinations of large and small wavenumbers to obtain a speedup. Therefore, we must either perform all or none of the operations with the emulator of inexact hardware.

Figures 3 and 4 show the forecast error for geopotential height and the surface pressure field for the different simulations. There is no visible increase in mean error for geopotential height if the emulated inexact hardware is applied to the calculations on the Gaussian grid with a fault rate of *p* = 0.1 or a 6-bit significand or to the FFT with a fault rate of *p* = 0.01 or an 8-bit significand. However, there is a strong impact on the simulations if the FFT is calculated with a fault rate of *p* = 0.1 or with a 6-bit significand. The error pattern that is visible in the surface pressure field for the “FFT *p* = 0.1” simulation appears to be fairly isotropic.

Gaussian grid and FFT. Mean error in geopotential height for simulations with (left) an emulated stochastic processor and (right) reduced floating point accuracy.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Gaussian grid and FFT. Mean error in geopotential height for simulations with (left) an emulated stochastic processor and (right) reduced floating point accuracy.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Gaussian grid and FFT. Mean error in geopotential height for simulations with (left) an emulated stochastic processor and (right) reduced floating point accuracy.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Gaussian grid and FFT. Snapshot of surface pressure after 15 days of simulation in which all floating point operations on the Gaussian grid or for the FFT are calculated with emulated inexact hardware. Perturbations are visible for the simulations FFT *p* = 0.01 and 0.1.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Gaussian grid and FFT. Snapshot of surface pressure after 15 days of simulation in which all floating point operations on the Gaussian grid or for the FFT are calculated with emulated inexact hardware. Perturbations are visible for the simulations FFT *p* = 0.01 and 0.1.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Gaussian grid and FFT. Snapshot of surface pressure after 15 days of simulation in which all floating point operations on the Gaussian grid or for the FFT are calculated with emulated inexact hardware. Perturbations are visible for the simulations FFT *p* = 0.01 and 0.1.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

#### 3) Legendre transformation

The Legendre transformation is the most expensive part of the model. While the so-called inverse transformations from spectral to gridpoint space are associated with about 38% of the computational cost, the “direct transformations” from grid point to spectral space are associated with about 23% of the computational cost. Since the Legendre transformations account for more than half of the overall numerical cost at T85 resolution, we perform a thorough investigation of this part of the model. Figure 5 shows the forecast error for geopotential height for all simulations performed and Fig. 6 shows the surface pressure field for the most significant simulations. In the first set of simulations, we put either the whole Legendre transformation (Full), or the floating point operations for spherical harmonics with a total wavenumber greater than 20 or 30 (WN 21 and WN 31), onto the emulator. While errors are small for the simulations with *p* = 0.01 and with an 8-bit significand, errors are significantly larger for *p* = 0.1 and for a 6-bit significand, compared with the forecast errors of the T85 control simulation.

Legendre transformation. Mean error in geopotential height for simulations with an emulated stochastic processor with a fault rate of (top left) *p* = 0.01 and (top right) *p* = 0.1 and reduced floating point accuracy of the significand of (bottom left) 8 and (bottom right) 6 bits. Inexact hardware was emulated for the calculation of spherical harmonics with global wavenumbers greater than 20 (WN 21), 30 (WN 31), and 64 (WN 65); additional simulations were performed in which only the inverse or the direct Legendre simulation was perturbed or only the calculation of spherical harmonics with a zonal wavenumber greater than 30. It is surprising that the forecast error of the zonal-scale separation is still smaller when compared to the WN 65 run for simulations with an emulated stochastic processor.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Legendre transformation. Mean error in geopotential height for simulations with an emulated stochastic processor with a fault rate of (top left) *p* = 0.01 and (top right) *p* = 0.1 and reduced floating point accuracy of the significand of (bottom left) 8 and (bottom right) 6 bits. Inexact hardware was emulated for the calculation of spherical harmonics with global wavenumbers greater than 20 (WN 21), 30 (WN 31), and 64 (WN 65); additional simulations were performed in which only the inverse or the direct Legendre simulation was perturbed or only the calculation of spherical harmonics with a zonal wavenumber greater than 30. It is surprising that the forecast error of the zonal-scale separation is still smaller when compared to the WN 65 run for simulations with an emulated stochastic processor.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Legendre transformation. Mean error in geopotential height for simulations with an emulated stochastic processor with a fault rate of (top left) *p* = 0.01 and (top right) *p* = 0.1 and reduced floating point accuracy of the significand of (bottom left) 8 and (bottom right) 6 bits. Inexact hardware was emulated for the calculation of spherical harmonics with global wavenumbers greater than 20 (WN 21), 30 (WN 31), and 64 (WN 65); additional simulations were performed in which only the inverse or the direct Legendre simulation was perturbed or only the calculation of spherical harmonics with a zonal wavenumber greater than 30. It is surprising that the forecast error of the zonal-scale separation is still smaller when compared to the WN 65 run for simulations with an emulated stochastic processor.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Legendre transformation. Snapshot of surface pressure after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Legendre transformation. Snapshot of surface pressure after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Legendre transformation. Snapshot of surface pressure after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Figure 5 also shows the results of simulations in which either the inverse or the direct Legendre transformation is performed on the emulator. It is perhaps surprising that there is a larger error when calculating the inverse than when calculating the direct Legendre transformation with inexact hardware (cf. the forecast error of WN 31 inverse and WN 31 direct). The opposite had been expected since a recent paper on the fast Legendre transformation, an algorithm that speeds up the calculation of the Legendre transformation using an inexact calculation with small error, has shown that the inverse Legendre transformation is less sensitive to inexactness than is the direct Legendre transformation for a full weather forecasting model at very high resolution (T1279 and T2047), including physics (Wedi 2014). These simulations were performed with the operational Integrated Forecast System (IFS) of the ECMWF. However, given that the types of inexactness are different, that there is a huge difference in model resolution between the two model configurations, and that the IFS includes physics, different results are not very surprising.

Additional simulations are performed in which emulated inexact hardware was used for spherical harmonics larger than a *zonal* wavenumber of 30 (WN 31 zonal). It is not surprising that the error for these simulations is smaller than the error for simulations in which scale separation is performed according to the global wavenumber (WN 31), since there are more spherical harmonics with a global than with a zonal wavenumber greater than 30. However, it is surprising that the forecast error of the zonal scale separation is still smaller when compared to the forecast error of a simulation that puts the spherical harmonics with global wavenumber larger than 65 onto the emulator (WN 65), for the simulations with an emulated stochastic processor. Such a simulation would involve fewer operations on the emulator than the zonal case. For the reduced precision tests both setups (WN 65 and WN 31 zonal) show a forecast error very similar to the control simulation.

The zonal truncation perturbs spherical harmonics with high wavenumbers in the zonal direction while the dynamics of spherical harmonics with high wavenumbers in the meridional direction are calculated with high precision. While a global truncation is isotropic in the two horizontal directions, a zonal truncation is not. A possible explanation for the relatively low forecast error in the WN 31 zonal *p* = 0.1 simulation is the different nature of dynamics in the meridional and zonal directions. The strong mean zonal velocity is much larger than the mean meridional velocity at most places on the planet. Therefore, meridional dynamics might be more sensitive to hardware errors at high wavenumbers. The surface pressure field in Fig. 6 of the WN 31 inverse *p* = 0.1 simulations confirms this hypotheses, since it shows small-scale perturbations in the meridional direction. This result might revitalize the discussion if the isotropic, triangular truncation is the optimal truncation to use in spectral models, or if, for example, a nonisotropic rhomboidal truncation might be advantageous (see, e.g., Simmons and Hoskins 1975), given that the dynamics are basically nonisotropic. However, the effect might be exaggerated in the given Held–Suarez setup, since we have no annual cycle, the zonal symmetry is not perturbed by topography, and we only use relatively low resolution. The effect would need to be approved in simulations at much higher resolution for which differences in meridional and zonal velocity become smaller before any meaningful conclusions can be drawn.

#### 4) Benchmark simulations

Having discussed the sensitivity of the different parts of the model to inexact hardware, we can now setup benchmark simulations in which as many parts of the model as possible are integrated using the inexact emulator, while trying to keep the resulting errors as small as possible. We perform simulations in which we put different parts of the model on an emulator that either emulates strong (*p* = 0.1 or a 6-bit significand) or weak (*p* = 0.01 or an 8-bit significand) perturbations. The approach of studying combinations of computing hardware at different accuracies is useful since a single processor might be able to work on several levels of accuracy, with several different levels of power consumption,^{3} and it might also be possible to distribute the workload to different processors at different accuracies that run in parallel. We perform the following three simulations for either emulated stochastic processors and/or emulated reduced precision:

Run 1—Calculations on the Gaussian grid and for spherical harmonics with zonal wavenumber greater than 30 for the inverse Legendre transform and global wavenumber greater than 30 for the direct Legendre transformation are performed with strong inexact perturbations. The rest of the Legendre transformation, the time stepping for the spherical harmonics with total wavenumber greater than 20, and the FFT are calculated with weak inexact perturbations.

Run 2—Similar to run 1, but with the FFT calculated with no hardware errors.

Run 3—Calculations on the Gaussian grid and for spherical harmonics with zonal wavenumber greater than 30 for the inverse and global wavenumber greater than 30 for the direct Legendre transformation are performed with strong perturbations. All other parts are calculated with exact arithmetic.

It is currently impossible to make realistic estimates of the actual savings that could be realized using inexact hardware, since the hardware has not been produced yet, and specific estimates can only be made for certain parts, such as the adder–subtractor and multiplier blocks of a floating point unit (see Düben et al. 2014a). However, we can evaluate what the comparable cost reduction for inexact hardware would need to be, to be competitive with the exact hardware given that a decrease in computational cost will allow the possibility of running higher-resolution simulations. To this end, we compare the forecast quality of simulations with the emulated inexactness at T85 resolution to the forecast quality of control simulations with double precision at T85 resolution and also to simulations made with double precision and lower resolution. In a rough approximation, we expect the total computational cost to increase proportionally to the cube of the total wavenumber, since we have two horizontal dimensions and the time dimension, while the vertical resolution stays constant at 20 levels. Therefore, we expect the computational costs of T42, T63, T73, and T77 simulations to be about 8.3, 2.5, 1.6, and 1.3 times less expensive than a T85 simulation. However, the crude approximation will be falsified since different parts of the model show different scalings of the computational cost with the total maximal wavenumber. Therefore, we performed additional measurements of the computing time that show that simulations with T42, T63, T73, and T77 resolution need approximately factors of 15.3, 2.8, 1.7, and 1.3 less time to be calculated on a single CPU compared to a simulation with T85 resolution. The expected decrease of computational cost is very similar for the T73 and T77 simulations in both approximations. It should be noted that T73 and T77 are not ideal wavenumbers for transformations between the space of the spherical harmonics and gridpoint space; this is a disadvantage in the speeding test.

For runs 1–3, approximately 45% of the computational cost, estimated in terms of execution time for the double-precision control simulation, is put on the emulator with strong inexact perturbations. Additionally, 53% or 34% of the computational cost is put on the emulator with weak perturbations for runs 1 and 2, respectively. To calculate these numbers, the ratio of execution time spent in the different subroutines was evaluated. Within a specific subroutine, the ratio of floating point operations on exact and inexact hardware was used to obtain a more detailed breakdown. Therefore, these numbers, while not themselves exact, serve as best possible estimates.

Figures 7 and 8 show the forecast error for geopotential height and the surface pressure fields for the benchmark simulations. The forecast errors for the three simulations with the emulated stochastic processor stay below the error curve of the unperturbed simulations with T73 resolution for the first 10 days. All three of the runs with the emulated stochastic processors show slight perturbation in the surface pressure fields. The forecast errors for the three simulations with the emulated reduced precision stay below the error curve of unperturbed simulations with T77 resolution and no perturbations are visible for the surface pressure fields.

Benchmark simulations. Mean error in geopotential height for simulations with an (left) emulated stochastic processor with *p* = 0.01 and *p* = 0.1 and (right) emulated reduced floating point accuracy in the significand to 8 and 6 bits.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. Mean error in geopotential height for simulations with an (left) emulated stochastic processor with *p* = 0.01 and *p* = 0.1 and (right) emulated reduced floating point accuracy in the significand to 8 and 6 bits.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. Mean error in geopotential height for simulations with an (left) emulated stochastic processor with *p* = 0.01 and *p* = 0.1 and (right) emulated reduced floating point accuracy in the significand to 8 and 6 bits.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. Snapshot of surface pressure after 15 days of simulation. The simulations with emulated stochastic processors show perturbations, while the simulations with emulated reduced precision do not.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. Snapshot of surface pressure after 15 days of simulation. The simulations with emulated stochastic processors show perturbations, while the simulations with emulated reduced precision do not.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. Snapshot of surface pressure after 15 days of simulation. The simulations with emulated stochastic processors show perturbations, while the simulations with emulated reduced precision do not.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Figure 9 and 10 show the energy spectra and the zonal velocity fields for the different benchmark simulations. The energy spectra should appear linear in the logarithmic plots, due to the turbulent cascade. For high wavenumbers, the linear shape breaks down since viscosity reduces energy for small scales. Differences between the spectra at different resolution are mainly caused by changes in viscosity, which can be reduced by increasing the resolution. It was already shown in Düben et al. (2014b) that the use of inexact hardware can lead to an imbalance, diagnosed by an increase in the contribution of divergent wind to the kinetic energy spectrum. This is again visible in the benchmark simulations. Runs 1 and 2 with emulated stochastic processors certainly show an unacceptable increase in the contribution of divergence, and the full spectra of kinetic energy show an increase for high wavenumbers as well. The increase in the spectra of the contribution of divergence is still visible in run 3 and it is also apparent if the fault rate in run 3 is reduced to *p* = 0.05. The perturbations for simulations with emulated stochastic processors are also visible in the zonal velocity fields. The results with emulated reduced precision perform much better for all diagnostics. The overall kinetic energy spectra are hardly changed and induced imbalances are comparably small as is visible in the contribution of divergence to the kinetic energy spectra. There are no visible perturbations in the zonal velocity fields.

Benchmark simulations. (top) Energy spectra and (bottom) spectra of the contribution of divergence to the kinetic energy spectra for (from left to right) the control simulations, the simulations with emulated stochastic processors, and the simulations with emulated reduced floating point precision after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. (top) Energy spectra and (bottom) spectra of the contribution of divergence to the kinetic energy spectra for (from left to right) the control simulations, the simulations with emulated stochastic processors, and the simulations with emulated reduced floating point precision after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. (top) Energy spectra and (bottom) spectra of the contribution of divergence to the kinetic energy spectra for (from left to right) the control simulations, the simulations with emulated stochastic processors, and the simulations with emulated reduced floating point precision after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. Snapshot of zonal velocity at the 10th vertical level (approximately 500 hPa) after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. Snapshot of zonal velocity at the 10th vertical level (approximately 500 hPa) after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

Benchmark simulations. Snapshot of zonal velocity at the 10th vertical level (approximately 500 hPa) after 15 days of simulation.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

For the simulations with the emulated stochastic processors, it is difficult to identify the source of the relatively large error. We did set up the simulations in such a way that we would not expect the errors to be so significant. Tests in which we perturbed the vorticity field by adding a random perturbation to each coefficient for the vorticity of the spherical harmonics with a global wavenumber larger than 30 after every time step showed a similar error pattern for the given spectra and fields (not shown here). Since a perturbation in vorticity can cause a similar error in the spectra for divergence in Fig. 10, we expect that the error is not solely caused by hardware errors in the calculation of the divergence field.

## 3. Simulations with OpenIFS in single precision

This section presents the results of simulations with the OpenIFS using single-precision floating point numbers. The first subsection gives a short introduction to OpenIFS and the changes that are necessary to allow single-precision simulations. The next subsection presents results on the quality of simulations in single precision. In the final subsection, information on the performance of single-precision simulations in comparison with double-precision simulations is provided.

### a. OpenIFS and necessary changes to allow simulations in single precision

OpenIFS is a portable version of the full forecast model of the Integrated Forecast System of the ECMWF, based on Cycle 38R1 of IFS (Carver et al. 2013; Carver and Vana 2014). It provides the forecast capability of IFS, but no data assimilation. OpenIFS works with a hydrostatic dynamical core and uses the physical parameterization and land surface schemes of IFS.

OpenIFS has about 500 000 lines of code in more than 2500 files (Carver and Vana 2014), mainly written in Fortran code. Appendix B provides an overview of the steps that are necessary to allow the use of single precision in most parts of the model integration. As indicated by the list, the modifications that are necessary to change between double and single precision in a model of the size of OpenIFS represent a serious effort. However, a rewriting of large parts of the model code is not necessary. It is likely that a version of OpenIFS could be built with little additional effort, with a flag to select between single or double precision in the namelist. We tested only a small subset of possible diagnostics but these did not reveal problems of the model setup. However, we do not claim that the developed single-precision version is working perfectly for all situations yet. It is possible that epsilon values need to be added to the code at several places, to make model simulations stable for all possible initial conditions, or that model simulations will “crash” for higher model resolutions. We experienced technical problems when we tried to run the single-precision version of the model with OpenMP but we believe that these problems can be solved in a more detailed analysis that goes beyond the scope of this proof-of-concept paper.

### b. Simulations and results

To test the single-precision setup, we performed several simulations at different horizontal resolutions (T21, T159, T255, T511, and T639 with approximately 950-, 125-, 80-, 39-, and 32-km resolution). We present results for the forecast with the highest resolution, T639, which is the resolution of the operational ensemble prediction system at ECMWF. The initial date of the forecast is 1 November 2012 at 0000 UTC.

Figures 11 and 12 show the results for geopotential at 500 hPa and temperature at 850 hPa after 1, 5, and 10 days of simulations for the single-precision setup, the double-precision control simulation of the ensemble forecast performed with IFS, and one ensemble member. To compare the results of OpenIFS and IFS, the model output of OpenIFS and the ensemble data are mapped onto the same regular grid at 2.5° resolution using a bilinear interpolation. All of the fields look reasonable, including the single-precision simulation. Figures 13 and 14 show the difference between either the single-precision simulation or one ensemble member of the ensemble forecast compared to the double-precision control simulation for geopotential at 500 hPa and temperature at 850 hPa and the standard deviation of the ensemble forecast. The differences between the single precision and the control simulation are always smaller than the differences between the control simulation and one ensemble member. Differences for the single-precision run are reasonably small compared to the ensemble standard deviation. This is promising since the ensemble spread is setup to resemble the uncertainty of the forecast. As long as the single-precision simulation stays within the “envelope” of the ensemble system, we can assume that the accuracy of the single-precision forecast is approximately equal to the accuracy of the double-precision forecast. However, we would need to make further tests and a statistical analysis to prove that single precision will always produce an acceptable error range for all possible flow regimes, but this is beyond the scope of this paper. It is possible that some of the differences between the single precision and the control simulation are caused by the use of different model versions (IFS and OpenIFS) and also by the mapping procedure used to allow a comparison of the results.

OpenIFS in single precision. Geopotential at 500 hPa for the (left) double-precision control simulation, (middle) single-precision simulation, and (right) one ensemble member of the operational ensemble system after (from top to bottom) 1, 5, and 10 days of simulations. All simulations use T639 resolution.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Geopotential at 500 hPa for the (left) double-precision control simulation, (middle) single-precision simulation, and (right) one ensemble member of the operational ensemble system after (from top to bottom) 1, 5, and 10 days of simulations. All simulations use T639 resolution.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Geopotential at 500 hPa for the (left) double-precision control simulation, (middle) single-precision simulation, and (right) one ensemble member of the operational ensemble system after (from top to bottom) 1, 5, and 10 days of simulations. All simulations use T639 resolution.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Temperature at 850 hPa for the (left) double-precision control simulation, (middle) single-precision simulation, and (right) one ensemble member of the operational ensemble system after (from top to bottom) 1, 5, and 10 days of simulations.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Temperature at 850 hPa for the (left) double-precision control simulation, (middle) single-precision simulation, and (right) one ensemble member of the operational ensemble system after (from top to bottom) 1, 5, and 10 days of simulations.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Temperature at 850 hPa for the (left) double-precision control simulation, (middle) single-precision simulation, and (right) one ensemble member of the operational ensemble system after (from top to bottom) 1, 5, and 10 days of simulations.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Absolute difference in geopotential at 500 hPa between the double-precision control simulation and either (left) the single-precision simulation or (middle) one ensemble member of the operational ensemble system, and (right) standard deviation of the ensemble after 1, 5, and 10 days of simulation. Note the different color bars on the different lines in the plot.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Absolute difference in geopotential at 500 hPa between the double-precision control simulation and either (left) the single-precision simulation or (middle) one ensemble member of the operational ensemble system, and (right) standard deviation of the ensemble after 1, 5, and 10 days of simulation. Note the different color bars on the different lines in the plot.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Absolute difference in geopotential at 500 hPa between the double-precision control simulation and either (left) the single-precision simulation or (middle) one ensemble member of the operational ensemble system, and (right) standard deviation of the ensemble after 1, 5, and 10 days of simulation. Note the different color bars on the different lines in the plot.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Absolute difference in temperature at 850 hPa between the double-precision control simulation and either (left) the single-precision simulation or (middle) one ensemble member of the operational ensemble system, and (right) standard deviation of the ensemble after 1, 5, and 10 days of simulation. Note the different color bars on the different lines in the plot.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Absolute difference in temperature at 850 hPa between the double-precision control simulation and either (left) the single-precision simulation or (middle) one ensemble member of the operational ensemble system, and (right) standard deviation of the ensemble after 1, 5, and 10 days of simulation. Note the different color bars on the different lines in the plot.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

OpenIFS in single precision. Absolute difference in temperature at 850 hPa between the double-precision control simulation and either (left) the single-precision simulation or (middle) one ensemble member of the operational ensemble system, and (right) standard deviation of the ensemble after 1, 5, and 10 days of simulation. Note the different color bars on the different lines in the plot.

Citation: Monthly Weather Review 142, 10; 10.1175/MWR-D-14-00110.1

### c. Model performance of the single-precision version

All simulations were performed on 64-bit CPU architecture. If a single-precision simulation is run on 64-bit architecture and if no vectorization is used, the flop rate is not necessarily increased compared to the flop rate at double precision. A speedup with single precision can only result from the reduced amount of data that needs to be stored, transported, or fitted into memory and cache. However, if the speed of a simulation is limited by data transport or memory, we expect an increase in performance by a factor of up to 2. For OpenIFS, the measured speedup was heavily dependent on the computing architecture used and the amount of parallelization. We obtain hardly any speedup for a T159 or a T255 simulation on a desktop computer with four message passing interface (MPI) tasks,^{4} while we see reductions in computing time of 22%, 27%, and 25% when running T159, T255, and T511 models on one computing node of a CPU cluster^{5} using eight MPI tasks. We could run a T511 single-precision simulation on the desktop computer and a T639 single-precision simulation on the computing node while this was not possible in double precision, due to the limited memory. Work is in progress to obtain meaningful comparisons within an operational environment. The use of single precision could also increase the flop rate, if 32-bit architecture were to be used.

## 4. Discussion

This paper is a natural extension of the work in Düben et al. (2014b), which investigated the impact of the use of inexact hardware on the climatology of the spectral dynamical core of an atmosphere model using hardware emulators. Here, the focus is on the skill of short- and medium-range forecasts; this enables us to work with higher resolution [T85 compared to T42 in Düben et al. (2014b)]. However, we believe that this study is also relevant for longer-range predictions (monthly, seasonal, and decadal). We provide a detailed analysis of the sensitivities of the different parts of the model to the use of inexact hardware for which either all of the bits in the significand of floating point operations are perturbed (describing an emulation of stochastic processors) or only less significand bits (emulating reduced precision arithmetic or bit-width truncated floating point units). The results show that large parts of the model allow the use of inexact hardware with high fault rates (*p* = 0.1), or a surprisingly large reduction of meaningful bits to effectively only 6 in the significand. However, the different sensitivities of the different parts of the model need to be respected (e.g., by scale separation).

We present results for benchmark simulations for which we try to put as many parts of the model as possible on emulated inexact hardware without having a serious degradation of the results. The first benchmark simulation has approximately 45% of the computational cost in terms of execution time calculated on hardware with one bit flip in the significand of 10% of all floating point operations and additionally about 53% of the computational cost on hardware with one bit flip in 1% of all floating point operations. Within the first 10 days of the forecast, the forecast error in geopotential height is smaller compared with the forecast error of a control simulation with coarser resolution that is approximately 40% cheaper than the T85 control simulation.

The second benchmark simulation has approximately 45% of the computational cost in terms of computing time calculated on hardware with an emulated significand of only 6 bits and additionally about 53% of the computational cost on hardware with an emulated significand of 8 bits. The forecast error in geopotential height is smaller than the forecast error of a control simulation with coarser resolution that is approximately 25% cheaper than the T85 control simulation.

Inexact hardware would need to show the same amount of savings via speedups or comparable reductions of computing costs for the T85 simulations, to be competitive with exact hardware. Obviously, a number of problems need to be solved for this comparison, such as the necessity to provide exact hardware for sensitive parts of the model, which would work in parallel with the inexact hardware. Savings in power consumption and delays for the floating point unit need to be considerably larger if the rest of the computing architecture is not changed, to compensate for the power consumption and delay of, for example, the memory. However, savings of 25% or 40% seem to be possible for the different approaches to inexact hardware. Furthermore, it is likely that a larger decrease in computational precision is possible if simulations are perturbed not between the global wavenumbers of 20 and 85, but between wavenumbers such as 1000 and 8000 for high-resolution simulations.

The benchmark simulations with emulated stochastic processors show problems with perturbations of the model balances and visible perturbations in the physical fields. These problems need to be addressed. The simulations with reduced floating point precision perform well in all diagnostics. Therefore, approaches that influence only the less significant bits (such as reduced precision arithmetic, bit-width truncation, and pruning) appear currently to be more promising than approaches that allow errors in all bits of the significand (such as stochastic processors).

The different nature of errors from stochastic processors and reduced precision hardware leads to different advantages and disadvantages between the two approaches. Since stochastic processors influence the full range of bits in the significand, they can cause imbalances in the model simulations and even very small error rates can reduce the quality of numerical simulations significantly. However, they do only add perturbations to a numerical simulation. The signal is typically not destroyed in total. On the other hand, the error pattern caused by reduced precision and numerical rounding effects only less significant parts of the calculations. If a relevant signal falls into the same significance level as the numerical rounding, the quality of the simulation is reduced significantly (see, e.g., the WN 21 6-bits simulation in Fig. 2) and the numerical signal can be totally destroyed. If, however, the gap between the rounding errors and significant bits is large enough, the numerical simulations are hardly disturbed.

The simulations with reduced precision suggest that about 98% of the computational workload of the double-precision simulation could be simulated with only 8 bits in the significand. In additional tests, we found that the parameter range of all floating point numbers within a typical model integration at T85 resolution could be represented with 6 bits in the exponent with no visible degradation of the model results. A reduction in the representation of floating point numbers to only 15 bits in nearly all parts of the model seems to be possible. Such a vast reduction in the number of bits would allow new approaches to hardware development. If the number of bits in the significand is reduced to only 8 bits, there are 2^{16} possible combinations of the 16 input bits to the multiplier block that computes the new significand in a floating point multiplication. All possible results could be stored by 2^{16} × 8 = 524 288 bits. This is less than a megabyte. There is even less storage necessary (2^{12} × 6 = 24 576) for the part of the floating point unit that covers the calculation of the new exponent. Such a vast reduction of possible combinations in comparison to double- or single-precision floating point numbers might allow a replacement of the processing unit by lookup tables that store all information of the floating point arithmetic. Obviously, the example above for a floating point multiplication does not yet consider changes in the exponents caused by the significand.

The simulations performed with single precision serve as a proof of concept that the trade-off between exactness and reduced computational cost is a worthwhile effort already for existing hardware and beyond the example of the plain dynamical core at relatively low resolution. We show that a full weather forecasting model based on spectral discretization methods can run on single precision without a serious degradation in results at the resolution of the operational ensemble forecast system (T639 ≈ 32 km) at ECMWF. The differences between the single- and double-precision simulations are smaller than the differences between ensemble members of the ensemble forecast and the control simulation. Since the spread of the forecast ensemble is tuned to represent the forecast uncertainty, it can be assumed that the accuracy of the single-precision forecast is not reduced. We see a moderate speedup already on 64-bit architecture (up to 25% reduced computing time). Future studies will assess the impact on speedup in an operational computing environment.

## Acknowledgments

Special thanks to Glenn Carver and the ECMWF team for essential help and patience when setting up the single-precision OpenIFS model. We thank Mike Blackburn, Fenwick Cooper, Jaume Joven, Avinash Lingamneni, Mio Matsueda, Hugh McNamara, Krishna Palem, and Nils Wedi, as well as the two anonymous reviewers, for helpful discussions and useful suggestions. Special thanks also to Oliver Fuhrer for useful insight into the potential of single-precision simulations. The position of Peter Düben is funded by an ERC grant (Towards the Prototype Probabilistic Earth-System Model for Climate Prediction, Project Reference 291406).

## APPENDIX A

### Floating Point Numbers

*s*; 52 bits for the so-called significand or mantissa

*b*

_{−1},

*b*

_{−2}, …,

*b*

_{−52}; and 11 bits for the exponent

*c*

_{10},

*c*

_{9}, …,

*c*

_{0}. A floating point number

*x*is given by

Therefore, a bit flip in the significand can lead to a change of the absolute value of a floating point number by no more than 50%.

## APPENDIX B

### Changes in OpenIFS to Run in Single Precision

The following list offers an overview of steps that are necessary to allow the use of single precision in most parts of the model integration:

The variable that adjusts precision in the declaration of standard floating point numbers JPRB is changed to the single-precision value.

A parser is used to replace double-precision declarations for floating point numbers with single-precision declarations, while keeping the declarations for the instrumentation tool “Dr. Hook” in double precision. A change of Dr. Hook into single precision would be very difficult due to frequent interactions with C code.

The timing scheme that measures the computing time spent in the different parts of the model needs to be changed back into double precision, due to frequent interactions with the system clock.

Compiling flags need to be changed to set the default real type to single precision.

On several occasions, floating point numbers will overshoot the range of the exponents of single-precision floating point numbers. Most often, these problems can be solved by simple reordering of the code, for example by changing

*x*^{3}/*y*^{3}into (*x*/*y*)^{3}if*x*and*y*are very different in magnitude. Sometimes, local declaration need be changed back to double precision.Routines that organize the communication of the message passing interface (MPI) need to be adjusted to the communication of single-precision floating point numbers.

Calls of subroutines from the Lapack library need to be changed from double to single precision [e.g., Double-Precision General Matrix Multiplication (DGEMM) is replaced by Single-Precision General Matrix Multiplication (SGEMM)].

Model initialization needs to be changed to allow the initialization of single-precision fields from standard double-precision input files.

Problems can appear if a division by a very small floating point number is performed. For single precision, there is a risk that the small number becomes indistinguishable by zero and for the division to turn into a division by zero. To avoid this, tiny numbers need to be added to the denominator to avoid a division by zero in several files. To generate a “tiny number,” we use the “epsilon( )” intrinsic Fortran function.

We had to change large parts of the local variables of the subroutine srtm_reftra, the modules srfwexc_vg_mod and suleg_mod, and the modules and subroutines called by suleg_mod back into double precision to allow proper simulations. These modules are responsible for computing the reflectivity and transmissivity of a clear and a cloudy layer, for calculating the fluxes between the soil layers and the right-hand side of the soil water equations, and for initializing the Legendre polynomials. A couple of single-precision values are set back to double precision locally in a couple of other modules. Most of the restoration of double-precision declarations could be avoided by local changes to the code. However, we wanted to keep the changes simple in the first approach to single-precision simulations and did not optimize the code.

## REFERENCES

Bergman, K., and Coauthors, 2008: Exascale computing study: Technology challenges in achieving exascale systems. AFRL Contract FA 8-650-07-C-7724, 278 pp. [Available online at http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.]

Blackburn, M., 1985: Program description for the multi-level global spectral model. Atmospheric Modelling Group, Department of Meteorology, University of Reading, 36 pp. [Available online at http://www.met.reading.ac.uk/~mike/dyn_models/igcm/doc/rsgup3_doc1985.pdf.]

Carver, G., and F. Vana, cited 2014: OpenIFS home. ECMWF. [Available online at https://software.ecmwf.int/wiki/display/OIFS/OpenIFS+Home.]

Carver, G., and Coauthors, 2013: The ECMWF OpenIFS model. European Geosciences Union General Assembly 2013, Vienna, Austria, EGU. [Available online at http://meetingorganizer.copernicus.org/EGU2013/EGU2013-4678.pdf.]

Detrey, J., and F. de Dinechin, 2007: Parameterized floating point logarithm and exponential functions for FPGAs.

*Microprocessors Microsystems,***31,**537–545, doi:10.1016/j.micpro.2006.02.008.Düben, P. D., J. Joven, A. Lingamneni, H. McNamara, G. De Micheli, K. V. Palem, and T. N. Palmer, 2014a: On the use of inexact, pruned hardware in atmospheric modelling.

,*Philos. Trans. Roy. Soc. London***372A**, doi:10.1098/rsta.2013.0276.Düben, P. D., H. McNamara, and T. Palmer, 2014b: The use of imprecise processing to improve accuracy in weather and climate prediction.

,*J. Comput. Phys.***271**, 2–18, doi:10.1016/j.jcp.2013.10.042.Gan, L., H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, 2013: Accelerating solvers for global atmospheric equations through mixed-precision data flow engine.

*Proc. 23rd Int. Conf. on Field Programmable Logic and Applications,*Porto, Portugal, IEEE, doi:10.1109/FPL.2013.6645508.Held, I. M., and M. J. Suarez, 1994: A proposal for the intercomparison of the dynamical cores of atmospheric general circulation models.

,*Bull. Amer. Meteor. Soc.***75**, 1825–1830, doi:10.1175/1520-0477(1994)075<1825:APFTIO>2.0.CO;2.Hogan, T. F., and T. E. Rosmond, 1991: The description of the Navy Operational Global Atmospheric Prediction System’s spectral forecast model.

,*Mon. Wea. Rev.***119,**1786–1815, doi:10.1175/1520-0493(1991)119<1786:TDOTNO>2.0.CO;2.Hoskins, B. J., and A. J. Simmons, 1975: A multi-layer spectral model and the semi-implicit method.

,*Quart. J. Roy. Meteor. Soc.***101**, 637–655, doi:10.1002/qj.49710142918.James, I. N., and J. P. Dodd, 1993: A simplified global circulation model. User's manual. Department of Meteorology, University of Reading, 42 pp. [Available online athttp://www.met.reading.ac.uk/~mike/dyn_models/igcm/doc/sgcm_inj_doc.pdf.]

Kahng, A., S. Kang, R. Kumar, and J. Sartori, 2010: Slack redistribution for graceful degradation under voltage overscaling.

*Proc. 15th Asia and South Pacific Design Automation Conf.,*Taipei, Taiwan, IEEE, 825 –831, doi:10.1109/ASPDAC.2010.5419690.Kong, C. T., 2008: Study of voltage and process variations impact on the path delays of arithmetic units. M.S. thesis, University of Illinois at Urbana–Champaign, 105 pp.

Lapillonne, X., and O. Fuhrer, 2013: Using compiler directives to port large scientific applications to GPUs: An example from atmospheric science.

,*Parallel Process. Lett.***24,**1450003, doi:10.1142/S0129626414500030.Lingamneni, A., C. Enz, J. L. Nagel, K. Palem, and C. Piguet, 2011: Energy parsimonious circuit design through probabilistic pruning.

*Proc. Design, Automation Test in Europe Conf. and Exhibition (DATE),*Grenoble, France, IEEE, doi:10.1109/DATE.2011.5763130.Michalakes, J., and M. Vachharajani, 2008: GPU acceleration of numerical weather prediction.

*Proc. Int. Symp. on Parallel and Distributed Processing,*Miami, FL, IEEE, doi:10.1109/IPDPS.2008.4536351.Narayanan, S., J. Sartori, R. Kumar, and D. L. Jones, 2010: Scalable stochastic processors.

*Proc. Design, Automation and Test in Europe Conf. and Exhibition (DATE),*Dresden, Germany, IEEE, 335–338, doi:10.1109/DATE.2010.5457181.Oriato, D., S. Tilbury, M. Marrocu, and G. Pusceddu, 2012: Acceleration of a meteorological limited area model with dataflow engines.

*Proc. Symp. on Application Accelerators in High Performance Computing,*Chicago, IL, IEEE, 129–132, doi:10.1109/SAAHPC.2012.8.Palem, K. V., 2003: Energy aware algorithm design via probabilistic computing: From algorithms and models to Moore’s law and novel (semiconductor) devices.

*Proc. CASES: Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems,*San Jose, CA, IEEE, 113–116, doi:10.1145/951710.951712.Palem, K. V., 2005: Energy aware computing through probabilistic switching: A study of limits.

,*IEEE Trans. Comput.***54**, 1123–1137, doi:10.1109/TC.2005.145.Palem, K. V., 2014: Inexactness and a future of computing.

,*Philos. Trans. Roy. Soc.***372A**, 2018, doi:10.1098/rsta.2013.0281.Palem, K. V., and A. Lingamneni, 2013: Ten years of building broken chips: The physics and engineering of inexact computing.

*ACM Trans. Embedded Comput. Syst.,***12,**87, doi:10.1145/2465787.2465789.Palmer, T. N., 2012: Towards the probabilistic earth-system simulator: a vision for the future of climate and weather prediction.

,*Quart. J. Roy. Meteor. Soc.***138**, 841–861, doi:10.1002/qj.1923.Rüdisühli, S., A. Walser, and O. Fuhrer, 2013: COSMO in single precision.

*COSMO Newsletter,*No. 14, 70–87. [Available online at http://www.cosmo-model.org/content/model/documentation/newsLetters/newsLetter14/cnl14_09.pdf.]Sartori, J., J. Sloan, and R. Kumar, 2011: Stochastic computing: Embracing errors in architecture and design of processors and applications.

*Proc. CASES: Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems,*Taipei, Taiwan, IEEE, 135–144.Shapiro, M., and Coauthors, 2010: An earth-system prediction initiative for the twenty-first century.

,*Bull. Amer. Meteor. Soc.***91**, 1377–1388, doi:10.1175/2010BAMS2944.1.Shukla, J., T. Palmer, R. Hagedorn, B. Hoskins, J. Kinter, J. Marotzke, M. Miller, and J. Slingo, 2010: Toward a new generation of world climate research and computing facilities.

,*Bull. Amer. Meteor. Soc.***91**, 1407–1412, doi:10.1175/2010BAMS2900.1.Simmons, A. J., and B. J. Hoskins, 1975: A comparison of spectral and finite-difference simulations of a growing baroclinic wave.

,*Quart. J. Roy. Meteor. Soc.***101**, 551–565, doi:10.1002/qj.49710142912.Simmons, A. J., and D. M. Burridge, 1981: An energy and angular-momentum conserving vertical finite-difference scheme and hybrid vertical coordinates.

,*Mon. Wea. Rev.***109,**758–766, doi:10.1175/1520-0493(1981)109<0758:AEAAMC>2.0.CO;2.Sloan, J., D. Kesler, R. Kumar, and A. Rahimi, 2010: A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance.

*Proc. Int. Conf. on Dependable Systems and Networks,*Chicago, IL, IEEE, 161–170, doi:10.1109/DSN.2010.5544923.Wedi, N. P., 2014: Increasing horizontal resolution in numerical weather prediction and climate simulations: Illusion or panacea?

,*Philos. Trans. Roy. Soc. London***372A**, 2018, doi:10.1098/rsta.2013.0289.

^{1}

Area, power, and delay are important quantities for chip development. Higher power leads to a higher energy demand, larger area causes longer distances for communication, and longer delay means decreased performance.

^{2}

Resolutions of atmosphere models based on spectral discretization methods are identified by a capital T followed by the wavenumber at which the spectral series of the spherical harmonics is truncated. The T represents triangular truncation (see Hoskins and Simmons 1975).

^{3}

It might be possible to realize inexact floating point units that are able to work on different levels of inexactness and power savings, for example when using pruning techniques (K. Palem 2014, personal communication).

^{4}

Intel Core i7—3770 CPU at 3.40 GHz × 8 with 15.6-GB memory.

^{5}

Intel(R) Xeon(R) CPU E5630 at 2.53 GHz with 50-GB memory.