• Cohard, J.-M., , and Pinty J.-P. , 2000: A comprehensive two-moment warm microphysical bulk scheme. I: Description and tests. Quart. J. Roy. Meteor. Soc., 126, 18151842.

    • Search Google Scholar
    • Export Citation
  • Hanappe, P., and Coauthors, 2011: FAMOUS, faster: Using parallel computing techniques to accelerate the FAMOUS/HadCM3 climate model with a focus on the radiative transfer algorithm. Geosci. Model Dev., 4, 835844.

    • Search Google Scholar
    • Export Citation
  • Hong, S.-Y., , and Lim J. O. J. , 2006: The WRF single-moment 6-class microphysics scheme (WSM6). J. Korean Meteor. Soc., 42, 129151.

  • Hong, S.-Y., , Dudhia J. , , and Chen S.-H. , 2004: A revised approach to ice-microphysical processes for the bulk parameterization of cloud and precipitation. Mon. Wea. Rev., 132, 103120.

    • Search Google Scholar
    • Export Citation
  • Horn, S., 2012: ASAMgpu V1.0—A moist fully compressible atmospheric model using graphics processing units (GPUs). Geosci. Model Dev., 5, 345353.

    • Search Google Scholar
    • Export Citation
  • Huang, B., , Mielikainen J. , , Oh H. , , and Huang H.-L. , 2011: Development of a GPU-based high-performance radiative transfer model for the infrared atmospheric sounding interferometer (IASI). J. Comput. Phys., 230, 22072221.

    • Search Google Scholar
    • Export Citation
  • Hwu, W.-M. W., Ed., 2011: GPU Computing Gems. Applications of GPU Computing Series, Vol. 1, Emerald ed. Morgan Kaufmann, 886 pp.

  • Lee, V., and Coauthors, 2010: Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. Proc. 37th Annual Int. Symp. on Computer Architecture, Saint-Malo, France, Association for Computing Machinery, 451–460, doi:10.1145/1815961.1816021.

  • Lim, K.-S. S., , and Hong S.-Y. , 2010: Development of an effective double-moment cloud microphysics scheme with prognostic cloud condensation nuclei (CCN) for weather and climate models. Mon. Wea. Rev., 138, 15871612.

    • Search Google Scholar
    • Export Citation
  • Mielikainen, J., , Huang B. , , and Huang H.-L. A. , 2011: GPU-accelerated multi-profile radiative transfer model for the infrared atmospheric sounding interferometer. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 4, 691700.

    • Search Google Scholar
    • Export Citation
  • Plaza, A., , Plaza J. , , and Vegas H. , 2011: Improving the performance of hyperspectral image and signal processing algorithms using parallel, distributed and specialized hardware-based systems. J. Signal Process. Syst., 61, 293315.

    • Search Google Scholar
    • Export Citation
  • Preis, T., , Virnau P. , , Paul W. , , and Schneider J. , 2009: GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model. J. Comput. Phys., 228, 44684477.

    • Search Google Scholar
    • Export Citation
  • Sanders, J., , and Kandrot E. , 2011: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, 312 pp.

  • Setoain, J., , Prieto M. , , Tenllado C. , , Plaza A. , , and Tirado F. , 2007: Parallel morphological endmember extraction using commodity graphics hardware. IEEE Geosci. Remote Sens. Lett., 4, 441445.

    • Search Google Scholar
    • Export Citation
  • Skamarock, W. C., , Klemp J. B. , , Dudhia J. , , Gill D. O. , , Barker D. M. , , Wang W. , , and Powers J. G. , 2005: A description of the advanced research WRF version 2. NCAR Tech. Note NCAR/TN-468+STR, 88 pp.

  • View in gallery

    Microphysics processes for the prediction of mixing ratios of different classes of hydrometeors in WDM6.

  • View in gallery

    Microphysics processes for the prediction of the number concentrations in WDM6.

  • View in gallery

    Loop fusion of the for-loops into one for-loop allows scalarization of the temporary variable wi into two scalars: wi_curr and wi_next.

  • View in gallery

    Parallelization of the computational domain. Domain is divided into tiles. Each tile will be handled by a thread block that has a 64 × 1 × 1 dimension. Threads will execute the computing instruction of WDM6 for the respective node.

  • View in gallery

    Automatic assignment of thread blocks to MPs on a GPU.

  • View in gallery

    Padded data and threads are applied to satisfy coalesced global memory access.

  • View in gallery

    The computational domain is divided into several groups and handled by two streams.

  • View in gallery

    Execution timeline of WDM6, where data transfer overlaps the computation for increased speed.

  • View in gallery

    Memory hierarchy in a GPU.

  • View in gallery

    Illustration of multi-GPU implementation.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 59 59 6
PDF Downloads 35 35 5

Speeding Up the Computation of WRF Double-Moment 6-Class Microphysics Scheme with GPU

View More View Less
  • 1 Space Science and Engineering Center, University of Wisconsin–Madison, Madison, Wisconsin
  • | 2 NOAA/NESDIS/Center for Satellite Applications and Research, College Park, Maryland
© Get Permissions
Full access

Abstract

The Weather Research and Forecasting model (WRF) double-moment 6-class microphysics scheme (WDM6) implements a double-moment bulk microphysical parameterization of clouds and precipitation and is applicable in mesoscale and general circulation models. WDM6 extends the WRF single-moment 6-class microphysics scheme (WSM6) by incorporating the number concentrations for cloud and rainwater along with a prognostic variable of cloud condensation nuclei (CCN) number concentration. Moreover, it predicts the mixing ratios of six water species (water vapor, cloud droplets, cloud ice, snow, rain, and graupel), similar to WSM6. This paper describes improving the computational performance of WDM6 by exploiting its inherent fine-grained parallelism using the NVIDIA graphics processing unit (GPU). Compared to the single-threaded CPU, a single GPU implementation of WDM6 obtains a speedup of 150× with the input/output (I/O) transfer and 206× without the I/O transfer. Using four GPUs, the speedup reaches 347× and 715×, respectively.

Corresponding author address: Jarno Mielikainen, Space Science and Engineering Center, University of Wisconsin–Madison, 1225 W. Dayton St., Madison, WI 53706. E-mail: mielikai@gmail.com

Abstract

The Weather Research and Forecasting model (WRF) double-moment 6-class microphysics scheme (WDM6) implements a double-moment bulk microphysical parameterization of clouds and precipitation and is applicable in mesoscale and general circulation models. WDM6 extends the WRF single-moment 6-class microphysics scheme (WSM6) by incorporating the number concentrations for cloud and rainwater along with a prognostic variable of cloud condensation nuclei (CCN) number concentration. Moreover, it predicts the mixing ratios of six water species (water vapor, cloud droplets, cloud ice, snow, rain, and graupel), similar to WSM6. This paper describes improving the computational performance of WDM6 by exploiting its inherent fine-grained parallelism using the NVIDIA graphics processing unit (GPU). Compared to the single-threaded CPU, a single GPU implementation of WDM6 obtains a speedup of 150× with the input/output (I/O) transfer and 206× without the I/O transfer. Using four GPUs, the speedup reaches 347× and 715×, respectively.

Corresponding author address: Jarno Mielikainen, Space Science and Engineering Center, University of Wisconsin–Madison, 1225 W. Dayton St., Madison, WI 53706. E-mail: mielikai@gmail.com

1. Introduction

The Weather Research and Forecasting model (WRF; Skamarock et al. 2005) is a community numerical weather prediction (NWP) model that has been designed for operational forecasting and atmospheric research. The WRF finds application in a variety of areas, including air quality modeling, tropical storm prediction, simulation of wild fire, prediction of regional climate, and storm-scale research. It is applicable to various scales of weather phenomena ranging from meters to several thousand kilometers. The WRF offers multiple schemes for every physical component. Specifically for water and ice particles that have formed in the atmosphere (hydrometeors), such as clouds and precipitation, WRF single-moment microphysics schemes (WSMs) such as the WSM 3 class (WSM3), WSM5, and WSM6 are widely used. However, such single-moment schemes only predict the mixing ratio of the hydrometeors by representing the hydrometeor size with a distribution function for each class. A double-moment scheme such as WDM6 can represent the mixing ratio of the hydrometeors along with their number concentrations (Cohard and Pinty 2000).

WRF exhibits fine-grained parallelism that cannot be efficiently exploited by CPUs, owing to their limited bandwidth. Hence, in order to attain performance scaling, graphics processor units (GPUs) are utilized to harness the data parallelism in WRF modules. GPUs possess a large bandwidth needed to achieve high floating point compute rates. Furthermore, their highly parallel structure makes them more effective than general purpose CPUs for algorithms where processing of large blocks of data has to be done in parallel. Therefore, GPUs have emerged as a low-cost, low-power (watts per flop), high-memory bandwidth, and high-performance alternative to conventional microprocessors (Hwu 2011). Current GPUs offer high-end performance with more than 100-G floating point instructions per second and as much as 100 GB s−1 streaming memory bandwidth.

CPUs are designed to provide high performance for a single task. However, this high performance results in an increase in die area and power consumption. Thus, the power and thermal envelopes allow packing only a few processing cores on the same CPU die. GPUs are designed so that they trade-off single-thread performance for increased parallel processing performance. The approach works well when there is ample data parallelism available (Lee et al. 2010). In this paper, we study the suitability of GPU computation for a WRF microphysics module.

GPU-based processing of atmospheric and remote sensing data has become increasingly popular owing to their exciting computational capabilities. Several efforts can be found in literature related to GPU-based acceleration of computationally intensive algorithm. Horn (2012) presented a three-dimensional compressible moist atmospheric model that had calculations done using GPUs (ASAMgpu). Hanappe et al. (2011) optimized the atmospheric radiation algorithm of the Fast Met Office/UK Universities Simulator (FAMOUS) climate model on several hardware platforms including a GPU using Open Computing Language (OpenCL). A GPU-accelerated Monte Carlo simulation of the 2D and 3D Ising model was proposed by Preis et al. (2009) for describing ferromagnetism. Setoain et al. (2007) described a GPU-based implementation of the automated morphological spectrally pure signature (end member) extraction algorithm from remotely sensed hyperspectral data. A parallel scheme using GPUs to accelerate the radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) on board the first European meteorological polar-orbiting satellite was proposed in Mielikainen et al. (2011) and Huang et al. (2011). Plaza et al. (2011) reported a speedup of 15× for the hyperspectral pixel purity index end member extraction algorithm.

In this paper, we implement the WRF double-moment 6-class microphysics scheme (WDM6) on NVIDIA GPUs using Compute Unified Device Architecture (CUDA). CUDA is a parallel computing architecture developed by NVIDIA that provides a C-like abstraction for executing on the GPU (Sanders and Kandrot 2011). It gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPU. A CUDA program comprises two parts: 1) a serial program running on the CPU and 2) a parallel program running on the GPUs. Our GPU implementation of WDM6 was tested on a low-cost personal supercomputer with two NVIDIA GTX 590 GPUs (total 2048 cores). We then compare the obtained results with a native CPU implementation of the same scheme to show the amount of speedup achieved by computing WDM6 in a highly data-parallel GPU-based system.

The rest of the paper is organized as follows. Section 2 describes WDM6 in detail. Section 3 explains the CUDA computing engine based on NVIDIA GPUs and gives the detailed implementation and results of the GPU-accelerated WDM6. Section 4 concludes the paper.

2. WDM6

WDM6 is an extension of WSM6 (Hong and Lim 2006), which predicts not only the mixing ratio of hydrometeors but also their number concentrations. It also adds a prognostic variable for the number concentration of cloud condensation nuclei (CCN). Prognostic water substance variables include water vapor, clouds, rain, ice, snow, and graupel for both WDM6 and WSM6, as shown in Fig. 1. Table 1 provides a key to Fig. 1. Number concentration of cloud and rainwater are predicted by WDM6, whereas the number concentrations of ice species such as graupel, snow, and ice are diagnosed following the ice-phase microphysics of Hong et al. (2004). This assumption is made because the prediction of ice-phase number concentrations has a lesser impact on the results than the prediction of warm-phase concentrations in deep convective cases (Lim and Hong 2010).

Fig. 1.
Fig. 1.

Microphysics processes for the prediction of mixing ratios of different classes of hydrometeors in WDM6.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

Table 1.

Key to Fig. 1.

Table 1.

WDM6 is useful in applications such as the investigation of aerosol effects on clouds, where a prediction of both mass of cloud droplets and their number concentration is required. In WDM6, more flexible particle size distributions of the various hydrometeors types are used than in the single-moment microphysics schemes.

WDM6 can efficiently represent the major characteristics of a double-moment microphysics scheme in terms of radar reflectivity and number concentrations of rain. Figure 2 gives a description of the prediction of the number concentrations in WDM6 and Table 2 gives the key to Fig. 2.

Table 2.

Key to Fig. 2.

Table 2.
Fig. 2.
Fig. 2.

Microphysics processes for the prediction of the number concentrations in WDM6.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

Conceptualization of warm rain processes such as autoconversion and accretion in WDM6 is based on the study carried out by Cohard and Pinty (2000). For the rest of the source and sink terms in warm rain processes, WSM6 principles were incorporated as described in Hong and Lim (2006).

3. GPU implementation

For this research, we work on a low-cost NVIDIA Fermi personal supercomputer with Intel Core i7 970 CPU (six cores at 2.20 GHz), equipped with two NVIDIA GeForce GTX 590 GPUs and running a 64-bit Linux operating system. The specifications of GeForce GTX 590 are described in Table 3.

Table 3.

Specification of NVIDIA GeForce GTX 590 GPU.

Table 3.

The original WRF WDM6 module was written in FORTRAN and compiled by GFortran with O3, ftree-vectorize, ftree-loop-linear, and funroll-loops options. We need to convert this FORTRAN code into C version before we start writing in CUDA. In the C version of the WRF WDM scheme, it is easy for us to do some early optimizations. For example, we can modify the algorithm flow by reducing the use of loop operation in the code. Several loop operations can be merged into single-loop operation, so that we can replace some temporary arrays with single variables. Because of data parallelism, each thread must have its own local copies of the temporary data. This can increase memory usage considerably. However, array scalarization after loop fusion reduces memory usage back to the level of single-threaded CPU code. Furthermore, the remaining scalar variables can be kept in a faster register memory instead of slower global memory, which is required by the larger arrays. As shown in a code snipped in Fig. 3, loop fusion enables using a scalar for a temporary value instead of a 3D array. On GPU, this kind of memory access optimization is essential for good performance. This kind of modification was determined to be counterproductive on CPU, as memory access is not as expensive on a CPU as on a GPU, relatively speaking.

Fig. 3.
Fig. 3.

Loop fusion of the for-loops into one for-loop allows scalarization of the temporary variable wi into two scalars: wi_curr and wi_next.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

We use standard a GNU Compiler Collection (GCC) to compile the C version. By using standard compiler options, GCC and gfortran will produce the same output for C and FORTRAN codes. We then write the CUDA version of WDM6 based on its optimized C version. The speedup is measured by comparing the execution time of the CUDA version and the FORTRAN version.

To test WDM6, we used a continental United States (CONUS) benchmark test dataset for a 12-km resolution domain for 24 October 2001. A WRF domain is a geographical region of interest discretized into a two-dimensional grid parallel to the ground. Each grid point has multiple levels that correspond to various heights in the atmosphere. The size of the CONUS 12-km domain is 433 × 308 horizontal grid points with 35 vertical levels. As shown in Fig. 4, each domain point is mapped to a different thread on a GPU. On a GPU, the computational problem is divided into a grid of thread blocks. Each thread block consists of a number of threads that are executed in a multiprocessor (MP). This process is illustrated in Fig. 5.

Fig. 4.
Fig. 4.

Parallelization of the computational domain. Domain is divided into tiles. Each tile will be handled by a thread block that has a 64 × 1 × 1 dimension. Threads will execute the computing instruction of WDM6 for the respective node.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

Fig. 5.
Fig. 5.

Automatic assignment of thread blocks to MPs on a GPU.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

CUDA takes a bottom-up point of view of parallelism, in which a thread is an atomic unit of parallelism. Threads are organized into a three-level hierarchy. The highest level is a grid, which consists of thread blocks. Thread blocks implement coarse-grained scalable data parallelism and they are executed independently, which allows them to be scheduled in any order across any number of MPs. This allows the CUDA code to scale with the number of processors. The total number of CUDA thread is the product of the size of a grid and the size of a thread block. In the example in Fig. 5, grid size is N—that is, there are N thread blocks and the thread block size is M. Thus, the total number of threads is M × N. All thread blocks are not necessarily running at the same time because of hardware resources limitations. So, once one thread block is finished on a MP, the next block is assigned to that MP.

a. Coalesced memory access

To gain better performance in kernel execution, the global memory access should be coalesced. Coalescing is used to align memory accesses in global memory for a high memory data rate. We have applied data padding in global memory using CUDA programming function as illustrated in Fig. 6. The padding will increase the width of the arrays from 443 to 448. The determination of the padded array is taken care of by the CUDA library function. We refer readers to Sanders and Kandrot (2011) for details of coalesced global memory access.

Fig. 6.
Fig. 6.

Padded data and threads are applied to satisfy coalesced global memory access.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

Table 4 shows performance without coalesced memory access. Using coalesced memory, access speedup is increased from 147× to 165×. In Table 5, increased performance with coalesced memory access is shown. Unfortunately, data transfer time between CPU and GPU is longer when coalesced memory is used. This phenomenon can be observed in results in Table 6. This is an issue with the NVIDIA GPU system that is expected to be resolved in later revisions of NVIDIA software and hardware.

Table 4.

GPU performance.

Table 4.
Table 5.

Updated GPU performance.

Table 5.
Table 6.

I/O data transfer time from CPU to GPU and GPU to CPU for both noncoalesced and coalesced arrays.

Table 6.

b. Asynchronous copy

For GPU implementation with input/output (I/O) transfer, we can manage asynchronous data transfer through streams. A stream is a sequence of commands that executes in order. Asynchronous data transfer can be overlapped by kernel execution. This may increase the performance, since we hide the data transfer time behind kernel execution time.

We divide the computational domain into several groups. We then create two streams, where each stream will handle groups in sequence, as illustrated in Fig. 7. Figure 8 illustrates the execution timeline of WDM6. Once stream 0 finishes transferring a group of data to the GPU, it will automatically execute the CUDA kernel based on its group region. At the same time, stream 1 is transferring to another group of data to the GPU. At this point, we hide the data transferring time behind the kernel execution by stream 0. After stream 0 completes the kernel execution, stream 1 will start the execution of the CUDA kernel based on its group region. At the same time, stream 0 is transferring the output data back to the CPU. At this point, we hide the transferring time behind the kernel execution by stream 1. This process works repeatedly until the whole domain is completed.

Fig. 7.
Fig. 7.

The computational domain is divided into several groups and handled by two streams.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

Fig. 8.
Fig. 8.

Execution timeline of WDM6, where data transfer overlaps the computation for increased speed.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

Table 7 shows the updated performance of our GPU implementation. The CUDA kernel’s execution speed is better when coalesced memory access is used. However, the memory transfer speed is slower for coalesced arrays. Thus, the combination of noncoalesced memory access and asynchronous copy gives better performance. We conclude with the scenarios that give the best performance for the both GPU implementations with and without the I/O transfer. It is described in Table 8.

Table 7.

GPU performance with asynchronous memory copy for noncoalesced and coalesced memory access.

Table 7.
Table 8.

The best GPU performance for a single GPU.

Table 8.

c. Cache optimization

As illustrated in Fig. 9, the NVIDIA GPU architecture has two groups of memories: on-chip memory and dynamic random access memory (DRAM). On-chip memory consists of the register, level 1 (L1) cache, shared memory, and level 2 (L2) cache memories. DRAM is a place for global memory, texture memory, and constant memory. In this section, we explain the improvement that we may get by optimizing the cache.

Fig. 9.
Fig. 9.

Memory hierarchy in a GPU.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

L1 cache is used to cache both local and global memory. In on-chip memory, L1 and shared memory use the same hardware resources with a size of 64 kB. The allocation of L1 and shared memory is a combination of 48 and 16 kB. By default, there is no preference, but we are able set the preference for the allocation between L1 and shared memory. For our work, since we do not use shared memory we can set the preference to L1. This will make the L1 cache bigger (48 kB) than shared memory (16 kB). The selection can be done by passing the cudaFuncCachePreferL1 argument to the cudaDeviceSetCacheConfig() function.

In a 64-bit machine, by default the CUDA compiler driver, nvcc, will compile the CUDA code for the 64-bit architecture as the target machine. It is useful to gain access to more than 4 GB of system memory. But, there will be a performance penalty for the GPU because the pointers will occupy extra space in the register. Since we are not working with a large system memory, we set the nvcc to compile the CUDA code for a 32-bit architecture as the target machine. This can be done by setting -m32 as the compiler option.

Global memory is cached in both the L1 and L2 cache. At the same time, the faster L1 also caches the local memory. Most of the global memory accesses going through the L1 cache are not served by the L1 cache even when it is configured to be at the larger 48-kB size. This is because the parts of the memory the code uses—that is, a working dataset—is too large to fit in the L1 cache. Therefore, it is advantageous to disable the L1 cache, as disabling the L1 cache increases the speed of the algorithm, as the unnecessary search from the L1 cache is skipped. This can be done by setting -Xptxas dlcm=cg as a compiler option. After disabling the L1 cache, only the slightly slower and larger L2 cache is used to cache global memory accesses. In summary, larger L1 cache turned out to be better than the smaller L1 cache, but disabling the L1 cache altogether is the best option for WDM6. The improved performance with the L1 disabled and the 32-bit code generation enabled can be seen in Table 9. Slight improvement from using the 32-bit code generation instead of the 64-bit code can be explained by the fact that 32-bit memory pointers are only half the size of their 64-bit counterparts.

Table 9.

Final GPU performance both with and without I/O.

Table 9.

d. Multi-GPU implementation

We also applied our CUDA version of WDM6 to the multi-GPU. We have two NVIDIA GTX 590 dual GPUs, which means we have four GPUs with a total of 2048 cores. For multi-GPU implementation, we split the computational domain into four larger groups, as illustrated in Fig. 10. For the asynchronous scenario, in each larger group there will be small groups, which are handled by streams. Table 10 shows the performance that we obtained in the multi-GPU implementation. For multi-GPU implementation without the I/O transfer, the speedup reaches 715×. We obtain a speedup of 347× for the multi-GPU implementation with the I/O transfer.

Fig. 10.
Fig. 10.

Illustration of multi-GPU implementation.

Citation: Journal of Atmospheric and Oceanic Technology 30, 12; 10.1175/JTECH-D-12-00218.1

Table 10.

Multi-GPU performance both with and without I/O.

Table 10.

4. Conclusions

In this paper, we accelerated the WRF double-moment 6-class microphysics scheme (WDM6) in a highly parallel environment using many core NVIDIA GPUs. In our experiments, two groups of two GPUs, with each GPU group sharing a peripheral component interconnect (PCI) express bus, were used. An overall speedup of 150× with CPU-to-GPU I/O was achieved when noncoalesced memory access and asynchronous data transfer were used. A multi-GPU implementation obtains a speedup of 347x with GPU I/O. WDM6 is only an intermediate module of the entire WRF and therefore GPU I/O should not occur. We assume that the input data are already available in the GPU global memory from previous modules and that its output data reside at the GPU global memory for later usage by the other modules. Therefore, the speedup without the data transfer time is 206× when coalesced memory access is used. The speedup for multi-GPU implementation reaches 715×. It is clear that a large fraction of performance is being lost to data transfer overhead. Much of this should be amortizable as more and more weather model kernels are adapted to run and reuse the model state data on the GPU without moving it back and forth from the CPU. Therefore, the GPU-based implementation of WDM6 provided a low-cost and effective solution for analyzing microphysics modules in WRF. Future work will be rewriting the other WRF modules for GPU execution. This will mean rewriting several physics modules such as radiation and the planetary boundary layer. Furthermore, the most time-consuming WRF module, dynamics, will also be converted to be run on a GPU. After that, individual WRF modules will be combined and the whole WRF model can be run on a GPU for increased operational performance.

Acknowledgments

This work is supported by the National Oceanic and Atmospheric Administration (NOAA) under Grant NA10NES4400013.

REFERENCES

  • Cohard, J.-M., , and Pinty J.-P. , 2000: A comprehensive two-moment warm microphysical bulk scheme. I: Description and tests. Quart. J. Roy. Meteor. Soc., 126, 18151842.

    • Search Google Scholar
    • Export Citation
  • Hanappe, P., and Coauthors, 2011: FAMOUS, faster: Using parallel computing techniques to accelerate the FAMOUS/HadCM3 climate model with a focus on the radiative transfer algorithm. Geosci. Model Dev., 4, 835844.

    • Search Google Scholar
    • Export Citation
  • Hong, S.-Y., , and Lim J. O. J. , 2006: The WRF single-moment 6-class microphysics scheme (WSM6). J. Korean Meteor. Soc., 42, 129151.

  • Hong, S.-Y., , Dudhia J. , , and Chen S.-H. , 2004: A revised approach to ice-microphysical processes for the bulk parameterization of cloud and precipitation. Mon. Wea. Rev., 132, 103120.

    • Search Google Scholar
    • Export Citation
  • Horn, S., 2012: ASAMgpu V1.0—A moist fully compressible atmospheric model using graphics processing units (GPUs). Geosci. Model Dev., 5, 345353.

    • Search Google Scholar
    • Export Citation
  • Huang, B., , Mielikainen J. , , Oh H. , , and Huang H.-L. , 2011: Development of a GPU-based high-performance radiative transfer model for the infrared atmospheric sounding interferometer (IASI). J. Comput. Phys., 230, 22072221.

    • Search Google Scholar
    • Export Citation
  • Hwu, W.-M. W., Ed., 2011: GPU Computing Gems. Applications of GPU Computing Series, Vol. 1, Emerald ed. Morgan Kaufmann, 886 pp.

  • Lee, V., and Coauthors, 2010: Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. Proc. 37th Annual Int. Symp. on Computer Architecture, Saint-Malo, France, Association for Computing Machinery, 451–460, doi:10.1145/1815961.1816021.

  • Lim, K.-S. S., , and Hong S.-Y. , 2010: Development of an effective double-moment cloud microphysics scheme with prognostic cloud condensation nuclei (CCN) for weather and climate models. Mon. Wea. Rev., 138, 15871612.

    • Search Google Scholar
    • Export Citation
  • Mielikainen, J., , Huang B. , , and Huang H.-L. A. , 2011: GPU-accelerated multi-profile radiative transfer model for the infrared atmospheric sounding interferometer. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 4, 691700.

    • Search Google Scholar
    • Export Citation
  • Plaza, A., , Plaza J. , , and Vegas H. , 2011: Improving the performance of hyperspectral image and signal processing algorithms using parallel, distributed and specialized hardware-based systems. J. Signal Process. Syst., 61, 293315.

    • Search Google Scholar
    • Export Citation
  • Preis, T., , Virnau P. , , Paul W. , , and Schneider J. , 2009: GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model. J. Comput. Phys., 228, 44684477.

    • Search Google Scholar
    • Export Citation
  • Sanders, J., , and Kandrot E. , 2011: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, 312 pp.

  • Setoain, J., , Prieto M. , , Tenllado C. , , Plaza A. , , and Tirado F. , 2007: Parallel morphological endmember extraction using commodity graphics hardware. IEEE Geosci. Remote Sens. Lett., 4, 441445.

    • Search Google Scholar
    • Export Citation
  • Skamarock, W. C., , Klemp J. B. , , Dudhia J. , , Gill D. O. , , Barker D. M. , , Wang W. , , and Powers J. G. , 2005: A description of the advanced research WRF version 2. NCAR Tech. Note NCAR/TN-468+STR, 88 pp.

Save