• Bryan, K., 1969: A numerical method for the study of the circulation of the world ocean. J. Comput. Phys.,4, 347–376.

    • Crossref
    • Export Citation
  • Cox, M. D., 1975: A baroclinic numerical model of the world ocean: Preliminary results. Numerical Models of Ocean Circulation, R. O. Reid, A. R. Roberson, and K. Bryan, et al., Eds., National Academy of Sciences, 107–120.

  • ——, 1984: A primitive equation 3-dimensional model of the ocean. GFDL Ocean Group Tech. Rep. 1, 141 pp. [Available from Geophysical Fluid Dynamics Laboratory/NOAA, Princeton University, Princeton, NJ 08542.].

  • Dukowicz, J. K., R. D. Smith, and R. C. Malone, 1993: A reformulation and implementation of the Bryan–Cox–Semtner ocean model in the Connection Machine. J. Atmos. Oceanic Technol.,10, 195–208.

  • Geist, A., A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, 1993: PVM3 user’s guide and reference manual. ORNL/TM-12187, 114 pp. [Available from Engineering Physics and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831.].

  • Holland, W. R., and F. O. Bryan, 1989: A high resolution simulation of the wind-and thermohaline-driven circulation in the North Atlantic Ocean. Parameterization of Small-scale Processes, Proc. ‘Aha Huliko’a 89, Hawaii Institute of Geophysics, Honolulu, HI, 99–116.

  • Manabe, S., R. J. Stouffer, M. J. Spelman, and K. Bryan, 1991: Transient responses of a coupled ocean–atmosphere model to gradual changes in atmospheric CO2. Part I: Annual mean response. J. Climate,4, 785–818.

    • Crossref
    • Export Citation
  • Mesinger, F., and A. Arakawa, 1976: Numerical methods used in atmospheric models. GARP Publication Series No. 17, 64 pp. World Meteorological Organization. [Available from World. Meteor. Org., Case Postale No. 5, CH-1211, Geneva 20, Switzerland.].

  • Mitchell, J. F. B., and J. M. Murphy, 1995: Transient response of the Hadley Centre coupled ocean–atmosphere model to increasing carbon dioxide. Part II: Spatial and temporal structure of response. J. Climate,8, 57–80.

    • Crossref
    • Export Citation
  • Pacanowski, R. C., 1995: MOM 2 documentation, user’s guide and reference manual. GFDL Ocean Group Tech. Rep. 3, 232 pp. [Available from Geophysical Fluid Dynamics Laboratory/NOAA, Princeton University, Princeton, NJ 08542.].

  • ——, K. Dixon, and A. Rosati, 1990: The GFDL Modular Ocean Model users guide, version 1.0. GFDL Ocean Group Tech. Rep. 2, 18 pp. [Available from Geophysical Fluid Dynamics Laboratory/NOAA, Princeton University, Princeton, NJ 08542.].

  • Sarmiento, J. L., R. D. Slater, M. J. R. Fasham, H. W. Ducklow, J. R. Toggweiler, and G. T. Evans, 1993: A seasonal three-dimensionalecosystem model of nitrogen cycling in the North Atlantic euphotic zone. Global Biogeochem. Cycles,7, 417–450.

    • Crossref
    • Export Citation
  • Semtner, A. J., 1974: A general circulation model for the World Ocean. Department of Meteorology, University of California, Los Angeles, Tech. Rep. 9, 99 pp.

  • Washington, W. M., and G. A. Meehl, 1989: Climate sensitivity due to increased CO2: Experiments with a coupled atmosphere and ocean general circulation model. Climate Dyn.,4, 1–38.

    • Crossref
    • Export Citation
  • Webb, D. J., 1993: An ocean model code for array processor computers. Internal Document No. 324, 21 pp. Institute of Oceanographic Sciences, Godalming, United Kingdom. [Available from National Oceanographic Library, Southampton Oceanography Centre, Empress Dock, Southampton S014 3ZH, United Kingdom.].

  • ——, 1996: An ocean model for array processor computers. Comput. Geosci.,22, 569–578.

    • Crossref
    • Export Citation
  • View in gallery
    Fig. 1.

    Organization of the slave oceanic program and subroutine step. Message passing between slaves is handled in subroutine step. The inner pair of message arrows refer to the free surface model fields, the outer pair to the remaining fields. Messages to other slaves are sent only at the positions shown but may be received at a number of additional points in the program. The organization of the moma generic program is similar but without the message passing.

  • View in gallery
    Fig. 2.

    An example of how a model ocean may be partitioned among 16 slave processes. The shape of the regions should be chosen to balance the workload and to minimize message passing.

  • View in gallery
    Fig. 3.

    Example, taken from a different ocean model, of the slave core and halo regions for process 4, showing its area of responsibility made up of its core and inner halo regions. The outer halo includes points belonging to other processes whose data is needed by process 4. Zeros are used to denote land. IMT_S and JMT_S denote the maximum size of the slaves arrays. IMT and JMT denote the region actually used.

  • View in gallery
    Fig. 4.

    Plot of power against the number of processors when the work per processor is kept constant.

  • View in gallery
    Fig. 5.

    Processor efficiency as a function of the number of processors when the work per processor is kept constant.

  • View in gallery
    Fig. 6.

    Plot of power against the number of processors when the total problem size is kept constant.

  • View in gallery
    Fig. 7.

    Processor efficiency as a function of the number of processors when the total problem size is kept constant.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 120 24 1
PDF Downloads 55 16 1

A Multiprocessor Ocean General Circulation Model Using Message Passing

David J. WebbJames Rennell Division, Southampton Oceanography Centre, Southampton, United Kingdom

Search for other papers by David J. Webb in
Current site
Google Scholar
PubMed
Close
,
Andrew C. CowardJames Rennell Division, Southampton Oceanography Centre, Southampton, United Kingdom

Search for other papers by Andrew C. Coward in
Current site
Google Scholar
PubMed
Close
,
Beverly A. de CuevasJames Rennell Division, Southampton Oceanography Centre, Southampton, United Kingdom

Search for other papers by Beverly A. de Cuevas in
Current site
Google Scholar
PubMed
Close
, and
Catherine S. GwilliamJames Rennell Division, Southampton Oceanography Centre, Southampton, United Kingdom

Search for other papers by Catherine S. Gwilliam in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

Numerical models of the ocean are widely used to study the physics, chemistry, and biology of the ocean and its effect on climate. Unfortunately limits in available computer power often restrict the length of model runs and the amount of detail that can be included in the models. For this reason there is interest in developing a code that can be used either with clusters of workstations or with the new generation of array-processor computers.

This paper reports on a new ocean general circulation model code that can be used on either a cluster of workstations or an array-processor computer. The model consists of one master control process and a number of slave oceanic processes, each of the latter being responsible for one subregion of the full ocean. The shapes of the subregions are variable allowing the computation and message passing load to be shared equally among processes.

The code has also been designed so that message passing between processes is asynchronous. This allows the message passing and computation to be overlapped and helps to prevent the development of bottlenecks. Finally, the code includes fully functioning archive, restart, snapshot, meteorological field updating, and progress reporting facilities.

The model code has been tested on a cluster of Unix workstations and on a Cray T3D. On the workstation cluster, message passing delays affect performance, but on the Cray T3D a very high level of parallelism is achieved.

Corresponding author address: Dr. David J. Webb, James Rennell Division, Southampton Oceanography Centre, Empress Dock, Southampton SO14 3ZH, United Kingdom.

Abstract

Numerical models of the ocean are widely used to study the physics, chemistry, and biology of the ocean and its effect on climate. Unfortunately limits in available computer power often restrict the length of model runs and the amount of detail that can be included in the models. For this reason there is interest in developing a code that can be used either with clusters of workstations or with the new generation of array-processor computers.

This paper reports on a new ocean general circulation model code that can be used on either a cluster of workstations or an array-processor computer. The model consists of one master control process and a number of slave oceanic processes, each of the latter being responsible for one subregion of the full ocean. The shapes of the subregions are variable allowing the computation and message passing load to be shared equally among processes.

The code has also been designed so that message passing between processes is asynchronous. This allows the message passing and computation to be overlapped and helps to prevent the development of bottlenecks. Finally, the code includes fully functioning archive, restart, snapshot, meteorological field updating, and progress reporting facilities.

The model code has been tested on a cluster of Unix workstations and on a Cray T3D. On the workstation cluster, message passing delays affect performance, but on the Cray T3D a very high level of parallelism is achieved.

Corresponding author address: Dr. David J. Webb, James Rennell Division, Southampton Oceanography Centre, Empress Dock, Southampton SO14 3ZH, United Kingdom.

1. Introduction

The Bryan–Cox–Semtner model is one of the most widely used general circulation models of the ocean. The model code was first developed in the late 1960s (Bryan 1969) and since then has undergone a series of changes designed to make better use of vector processing computers and incorporate better ocean physics(Semtner 1974; Cox 1984). Versions of the model have been used for studies of the large-scale circulation of the ocean (Cox 1975; Holland and Bryan 1989), for climate studies (Manabe et al. 1991; Mitchell and Murphy 1995; Washington and Meehl 1989), and to study the biological productivity of the ocean (Sarmiento et al. 1993). The code is now most widely available as the GFDL Modular Ocean Model (MOM, Pacanowski et al. 1990; Pacanowski 1995).

Although the code itself is highly developed, oceanographers’ ability to make use of it has always been restricted by the limits on available computer power. The problem arises in part because the scale of key ocean features is very small; the Gulf Stream is only 50 km wide; and theoceans themselves are large. Thus, many grid points are needed to represent the ocean. In addition, the time step used by the model is usually much less than an hour and runs of many model years are needed to study the gradual development of the ocean circulation. Thus, model runs use a very large number of time steps.

In recent years, the introduction of fast vector processing computers has enabled high-resolution models of the ocean to be developed. These have led to a significant improvement in our ability to represent and understand the ocean circulation. However, for climate research even more power is required, and for this purpose there has been a lot of interest in the use of the new generation of array-processor computers. These have the promise of very high power at a reasonable cost.

A code that worked with such a system might also be used with the arrays of UNIX workstations that are available to many oceanographic groups. This would allow such groups to increase the amount of detail in their regional ocean models.

The present paper reports on a new development1 of the Bryan–Cox–Semtner model code that can be run on array processors and groups of workstations. The new code is based on the moma code (Webb 1993, 1996), which itself was a development of the GFDL MOM code.

Moma is designed as a generic array-processor code. It is structured so that it can be readily converted for use with either explicit message passing calls, which transfer data between processors, or with a high-level compiler, which inserts the calls automatically. The array structure for the tracer and velocity fields has also been modified so that the most rapidly varying index corresponds to the vertical direction. This simplifies the horizontal partitioning of the ocean, and it has advantages for cache-based processors used in workstations and the Cray T3D. However, it may be less efficient than the standard MOM2 code in systems that require long vector lengths.

The new code, moma.pvm, is a message passing implementation of moma that uses the PVM message passing library (Geist et al. 1993). The advantage of using explicit message passing is that our knowledge of the model structure can be used to make the model very efficient. For example, it is possible with the new code to balance the computational and message passing load between processors, to overlap computation and message passing, and, by running the model asynchronously, to reduce the peak loadings on the message passing system. Eventually it may be possible to build the intelligence required to do this into compilers, but for the moment such compilers are not available.

The model consists of a master control process, which handles control functions and input/output, and a number of slave oceanic processes, each of which is responsible for calculations within a limited region of ocean. The shape of each slave’s region of responsibility is variable and may be irregular. This allows the computation and message passing load to be shared equally among processes.

Within each slave’s region, the model distinguishes between the core region, where calculations can proceed independent of any neighbors and the inner and outer halo regions, for which message passing is required. Calculations continue while the message passing is under way, and, since the message passing is asynchronous, the timing of the individual processes is not held rigid. This helps to reduce bottlenecks in the message passing system.

The model, through the control process, includes fully functioning archive, restart, snapshot, meteorological field updating, and progress reporting facilities.

The program is designed to run with each process running on a separate processor and with the control process running on a processor with good communications and storage. With an array processor, thelatter may be a front-end machine. However, this arrangement is not essential and the control process, which has only a light computational load, may share a processor with another process.

Section 2 of the paper is a brief introduction to the overall structure of the model. Section 3 is concerned with how the ocean is partitioned between processes and how each oceanic process organizes its work to overlap computations and message passing. Section 4 is concerned with the message passing logic that enables the processes to operate asynchronously, and section 5 discusses the supervisory and input/output functions of the control process. Tests of the model performance are presented in section 6.3

2. The overall structure of the model

The Bryan–Cox–Semtner model starts with a set of equations describing the evolution of the ocean. The changes in the horizontal momentum and the temperature and salinity fields are given by
i1520-0426-14-1-175-e02-1
The model also uses equations describing the density, pressure gradient, and incompressibility of the model ocean:
i1520-0426-14-1-175-e02-4

The main prognostic variables are u the horizontal velocity, T the potential temperature, and S the salinity. The other variables, p the pressure, ρ the density, and w the vertical velocity, can be calculated from the prognostic variables.

In these equations, t is time and f is the Coriolis term [equal to 2Ω sin(θ ), where Ω is the earth’s rotation rate and θ is the latitude]. The term D represents diffusion and F the forcing.

The horizontal velocity u is zero on solid sidewall boundaries. The gradients of potential temperature and salinity normal to all solid boundaries (including the bottom) are also zero. The upper surface boundary conditions are used to specify the exchange of heat and fresh water across the air–sea interface and the stress acting on the ocean due to the wind.

The model specifies the values of temperature and salinity at the nodes of a rectangular grid specified along lines of constant latitude, longitude, and depth. The velocity points are specified in a similar way but with an offset specified by the Arakawa B scheme (Mesinger and Arakawa 1976). The B grid is chosen because of its superiority in propagating oceanic Rossby waves.

The equations are then cast into finite-difference form and the finite-difference equations used to step the model temperature, salinity, and velocity fields forward in time. The simplest way to do this is to use an explicit time-stepping scheme in which all the equations use the same time step. Unfortunately, because of the very fast speed of tidal waves in the ocean, the time step has to be very short and so the model is computationally expensive to run.

A better scheme is to split off the tidal wave part of the solution, the barotropic mode, from the rest of the solution, the baroclinic mode. This can be done in a number of ways, but in the present version of the model it is done by solving the free surface equation explicitly using a short time step. The baroclinic equations are solved using a long time step, at the end of which the two modes are combined.

The organization of the program is shown schematically in Fig. 1. After initialization, the model enters the main baroclinic time-stepping loop. Within this outer loop it first enters a loop over the model’s latitude and longitude indices, where it updates the temperature, salinity, and baroclinic velocity variables for each vertical column of grid points.

Thecode then enters the barotropic time-step loop within which there is another loop over the latitude and longitude indices, this time updating at each horizontal grid point the two free surface velocity variables and the single free surface height variable.

After completing a series of barotropic time steps, the model enters a final loop over the horizontal indices in which the two horizontal velocities are combined. It is then ready to start another baroclinic time step.

3. Partitioning of the ocean

To get the best performance out of an array processor, it is necessary to organize the calculation so that each processor is fully utilized. In particular, the amount of time spent waiting for data to be transferred from neighboring processors needs to be kept to a minimum. In finite-difference models, calculations at one grid point that only involve information from the neighboring grid points, so one way that message passing can be kept to a minimum is to give each processor responsibility for a compact region of ocean. This is because, for a given total number of grid points per processor, the number of boundary points needing information exchange is then minimized.

The present model only allows horizontal partitioning of the ocean. The number of grid points in the vertical is relatively small so this arrangement helps reduce the number of interprocess data transfers. It also simplifies the model code because each process is responsible for the same region of ocean when updating both the two-dimensional free surface model fields and the three-dimensional temperature, salinity, and baroclinic velocity fields.

Optimal schemes for partitioning the ocean are still under development, so instead of using any particular scheme, the present model requires the user to construct beforehand an array describing which process is responsible for each horizontal grid point. If a box ocean was being studied, then a simple rectangular tiling would be effective both at balancing the load on each process and at minimizing the number of boundary points.

In a more realistic global model such a tiling would not be suitable. Some processes would be responsible for regions within continents and so have no work to do. Elsewhere, processes allocated regions of deep ocean would have more work than those allocated shallow regions of ocean.

To help prevent this from happening, the model is designed so that the regions of responsibility may have complex shapes. This allows tilings of the type shown in Fig. 2. The only limitations are those that arise from the limits on processor memory. The storage within each slave oceanic process is still organized on a longitude, latitude, and depth grid, so there has to be enough memory in each processor to store all the variables in the bounding rectangular box.

The process map, which defines how the ocean is partitioned, is read in by the master control process while it is initializing the model. For each oceanic process, the control identifies the minimum bounding box, which includes the oceanic process’s region of responsibility and the required boundary points from neighboring processes. It then sends the oceanic process a message giving the absolute coordinates of the origin of the bounding box, its size and copies of the processes map, and model depth array for the region of the bounding box.

On receiving the information, the oceanic process first checks that the bounding box can fit within its predefined arrays. The sizes of these are specified at compile time, but, although the option is not yet implemented, PVM allows oceanic processes, with predefined arrays of a different size, to be used together.

The oceanic process then reads in the rest of the information and completes its own initialization. This includes the setting up of pointer arrays for the revised loop structure and the message passing discussed in the next sections.

An example of an individualprocess’s copy of the process map is shown in Fig. 3. Three special regions are identified, the inner core region and the inner and outer haloes. The two halo regions represent points involved in the transfer of data between processes. The inner core region is the part of the process’s area of responsibility not involved in data transfers. At the beginning of each time step, calculations can be carried out in the core region without waiting for data from the neighboring processes.

The inner halo region is also part of the area of responsibility of the process, but time stepping cannot start for points in this region until data for nearby points in the outer halo have been received from the neighboring processes. At the end of each time step, model variables from the inner halo region have to be sent to the neighboring processes for use during the next time step.

The new model takes advantage of this structure by replacing the loops over latitude and longitude indices by a single loop that starts with all the points in the inner core region and then finishes with the points in the inner halo region. A check is inserted before any of the inner halo points are processed to ensure that all the outer halo data has arrived. This is illustrated in Fig. 1.

The revised loop structure allows a considerable period of time, while the inner core calculations are under way, for the boundary data to be passed between processes. If the computational load on the processes is equally balanced, this should mean that all the outer halo data arrives before the check is reached. In practice, the load balancing will not be perfect, so some processes will spend time idling at these points waiting for data.

4. Data transfer

The code uses two types of arrays for sending and receiving the halo data. For sending, an ordered list of halo points is constructed during program initiation. This contains the latitude and longitude index of the grid point and the process to which it is to be sent. Data from some of the halo points have more than one entry as they have to be sent to more than one process.

For receiving data a similar array is generated with the longitude and latitude index and the process from which it is to come. At the beginning of a receive this is used to set a count of the amount of data expected from each process. It is also used to set up an array in which outer halo points are set to the number of the process responsible for them and all other points are set to zero. Different arrays and counters are used for the baroclinic and barotropic loops. These are used to check that the data are received correctly.

As data from each horizontal grid point are received, the value of the corresponding array element is checked. If it is zero, an error is flagged. Otherwise it is set to zero and the corresponding process counter reduced by one. When all the counters reach zero, a flag is set indicating that all data has been received.

Each oceanic process checks the flag, and if necessary waits, before starting calculations involving the outer halo points. As soon as the data has been received, the check arrays are reinitialized and a message is sent to the neighboring processes giving permission to send data for the next time step. The main calculation then continues.

Similar logic is used at the end of each time-stepping loop before sending data. Each oceanic process checks to see if permission to send messages has been received from all of its neighbors. If necessary it waits, but otherwise it sends the boundary data and continues.

The start-up time, needed to initialize the message path between processors, can be relatively large. Because of this, the code keeps the number of data transfers to a minimum by using only one buffer for each message. One result of this is that the baroclinic messages, containing the three-dimensional tracer and velocity fields,can be very large. The barotropic messages are much smaller, but as there are of order one hundred barotropic time steps during each baroclinic one, it is these that are most likely to limit model performance.

Within the control and oceanic programs, the sending of messages is carried out by a single low-level routine, which itself calls the PVM message handling routines. Messages are received using a similar low-level routine. The latter is designed to be driven by interrupts, but with a system like PVM, where interrupts are not readily implemented, the alternative is to call it at regular intervals from within the program. The receive routine checks that all incoming messages are legal and updates the counters and flags as appropriate. If data for the master are waiting to be sent (see below) and the relevant flags have been received, then it also calls the low-level send routine to forward the data before exiting.

5. Control and archiving

The master control process is used to monitor progress of the model run and to shut it down if an error occurs. It also acts as the primary input–output interface and handles the input and distribution of the meteorological fields during the model run and output of model archive and summary datasets.

The control process is the first process started, and in a normal PVM implementation it spawns the oceanic processes. It then initializes the model by sending copies of the standard input file and the relevant sections of the process map, depth array, surface forcing fields, and restart dataset to each of the oceanic processes. At this point there is a checkpoint at which each process sends a message giving the position it has reached and enters a wait loop. When all the oceanic processes have reported in, the control process issues permission to continue.

Elsewhere in the program the oceanic processes use a similar message to report their position. This information is kept in a table and printed to standard output by the control process at regular intervals. For debugging the code, further position reporting points and checkpoints can be readily added. The oceanic processes can also send a time-stamped error message to standard output via the control process. A second option implemented is for each process to open its own error file. This has the advantage that it is simpler to handle a set of small error files than a single large standard output file.

After initialization, the control process enters a time-stepping loop equivalent to the baroclinic loop of the oceanic processes. However, the only work done within the loop is to test which of four input–output operations are to be carried out for the current time step. The four operations are the reading in of new meteorological data and the writing out of time-step summary information, model snapshot fields, and a full model archive dataset.

For the time-step summary, a request for information is sent to each process, and once all replies have been received a combined summary is printed. In each oceanic process when the same time step is reached, the summary information is calculated, stored, and a flag set. The low-level pseudo-interrupt routine forwards the information as soon as the local flag is set and permission from the control process has been received.

A check at the end of the barotropic time-stepping loop ensures that the oceanic process does not enter a new time step requiring summary information until all previous messages have been sent to the control. A similar check in the control process ensures that the program has printed the combined summary before continuing. As all the message transfers are handled by the low-level pseudo-interrupt routines, a high level of overlapping between the different message requests and computation can be achieved.

The snapshot and archive requests work in a similar manner. When a snapshot time step is reached, the oceanic processes copy the snapshot data into temporary storage. Usually thiscontains the free surface model height and velocity fields and the temperature, salinity, and velocities from two levels in the ocean. When the control process requests one of these fields, data are sent only for the oceanic process’s region of responsibility. The control process combines these data to create a two-dimensional array covering the whole model and writes this to disk.

At archive time steps, the oceanic processes copy the model data for a single model time step into a second area of temporary storage. The free surface datasets are then sent to the control using the same protocol as for the snapshot data. The remaining three-dimensional datasets are too large to be handled as a single unit by the control process. Instead it makes requests for a particular variable from a limited range of latitudes. When all the oceanic processes involved have replied, the control process writes the data to disk as a series of latitude slabs. The reverse procedure is used when restarting the model.

Finally the control process is responsible for the reading and distribution of the meteorological surface forcing data. With many datasets this occurs at the middle of each model month, interpolation between months being carried out by the oceanic processes. When the required time step is reached, the control process reads the new data from disk and sends a message to the oceanic processes stating that the new data can now be requested. On reaching the same time step, the oceanic processes request the data and receive it as a series of two-dimensional fields. The control process ensures that the current meteorological data has been sent to all of the oceanic processes before reading in the next month’s data.

6. Validation and performance tests

Validation of the new code was carried out by comparing results with those from the moma ocean model. The latter was validated earlier against the MOM code (Webb 1996). With both codes unoptimized, the end of run archive datasets agree exactly. The intermediate statistics (like the mean kinetic energy), which depend on the order of summation, agreed to within the expected rounding error.

Two sets of performance tests were carried out on a Cray T3D to measure the effect of the number of processors on the efficiency of the model. In both sets of tests a cyclic channel ocean was used with a width, in the north–south direction, of 64 grid boxes (66 including the northern and southern walls), and a depth of 15 grid boxes.

In the first set of tests, the work allocated to each oceanic process was kept constant. Each process was allocated a section of channel 64 grid boxes long, so as the number of processors increased, the length of the complete model channel also increased.4

In the second set of tests the total amount of work was kept constant. The total length of the channel was fixed at 64 grid boxes, and as the number of processors increased, the length of channel allocated to each one correspondingly decreased.

Tests were carried out using between 1 and 64 oceanic processors. One extra processor was used to run the control program. Model performance was measured in terms of the mean wall-clock time required to carry out one leapfrog barotropic time step. For comparison, moma, the single processor nonmessage passing of the code, was also timed for a similar channel with a length of 64 grid boxes.

The results are tabulated in Tables 1 and 2. These show the time taken for each test and the power and efficiency relative to the single processor moma code. If N is the number of slave processors in use, M the total number of model grid points, and t the time taken per time step, and if primed terms correspond to the moma stand alone code, then the power P and efficiency E are defined as
i1520-0426-14-1-175-eq02

The power and efficiency are plotted in Figs. 4 and 5 for the first set of tests and Figs. 6 and 7 for the second set of tests.

It is informative to first compare the time taken by a single slave oceanic process with that of the moma code. With the single slave, there are no slave-to-slave messages and only a few messages being sent to the master control process. Thus, the difference in times is a measure of the cost of reorganizing the code for message passing. The results show that this is relatively small.

The two-slave case introduces the full set of slave-to-slave messages. Table 1 shows that when the work per processor is kept constant, these introduce a small time penalty but the efficiency remains over 90%. As further processors are added, there are additional small time penalties, presumably due to contention on the message passing network, but even with 64 processors the system is still 84% efficient.

In the second set of tests, the computational work per processor is reduced. In principle, this should reduce the time taken for each test, but any delays introduced by the message passing system will also become more apparent. As shown in Table 2, when only a few processors are involved, the efficiency of the new code remains high. However, once more than 16 processors are involved, efficiency drops to below 50%.

In the 16-processor test each slave is responsible for a horizontal region of 16 × 64 grid boxes. This was approximately 6% of the maximum allowed by the system and so it represented a very light loading of each processor. As long as the processors are kept busy by being made responsible for regions larger than this, the efficiency of the code remains good.

Tests of the code have also been carried out in a Unix workstation environment using a group of nine computers connected by a 10 Mbit s−1 Ethernet network. Seven of the workstations had similar processor speeds and memory (32 Mbyte). The eighth, only used in the largest test case, was more powerful. A ninth workstation was used to run the master program.

Because other programs may be run on the workstations at any time and there are other heavy users of the network, each test was repeated a number of times. The results are summarized in Table 3 in terms of the minimum and maximum time per baroclinic time step, the minimum usually corresponding to tests carried out in the early morning when the network load was small.

The tests were made for the case where the work per processor was kept constant. Comparison of the single oceanic processor case with moma shows again that the single slave code is efficient. However, adding a second oceanic processor almost doubles the time taken, indicating a very high message passing overhead. As more slaves are added, the efficiency drops even further.

Efficiency can be improved by increasing the size of each slave region. This increases the work done by each slave so that the proportion of time spent message passing is reduced. However, in the test case, with only a small memory available on each processor, this also increases the probability of memory pages being swapped out to disk. This introduces additional delays, which again reduce the efficiency of the model.

In conclusion, the tests on the Cray T3D show that the efficiency of the present code is high, with efficiencies of more than 80% being possible with 64 processors. In the workstation environment, the model performance is limited by the relatively slow speed of the message passing system. However, the new code allows model development in such an environment, and it allows theuser to run models that are too large for a single workstation.

7. Summary

A PVM version of the Bryan–Cox–Semtner model has been described that can be run on clusters of workstations or on array processor computers such as the Cray T3D. The program partitions the ocean into horizontally compact domains whose size and shape can be chosen to balance the workload between processors. One processor is used for control, its main tasks being to initialize or restart the program and to collate archive and diagnostic information.

The code has been tested on Sun and Silicon Graphics Unix workstations and on a Cray T3D array processor computer. On the Cray T3D the code is very efficient, one of the tests with 64 processors showing an efficiency of over 80%.

REFERENCES

  • Bryan, K., 1969: A numerical method for the study of the circulation of the world ocean. J. Comput. Phys.,4, 347–376.

    • Crossref
    • Export Citation
  • Cox, M. D., 1975: A baroclinic numerical model of the world ocean: Preliminary results. Numerical Models of Ocean Circulation, R. O. Reid, A. R. Roberson, and K. Bryan, et al., Eds., National Academy of Sciences, 107–120.

  • ——, 1984: A primitive equation 3-dimensional model of the ocean. GFDL Ocean Group Tech. Rep. 1, 141 pp. [Available from Geophysical Fluid Dynamics Laboratory/NOAA, Princeton University, Princeton, NJ 08542.].

  • Dukowicz, J. K., R. D. Smith, and R. C. Malone, 1993: A reformulation and implementation of the Bryan–Cox–Semtner ocean model in the Connection Machine. J. Atmos. Oceanic Technol.,10, 195–208.

  • Geist, A., A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, 1993: PVM3 user’s guide and reference manual. ORNL/TM-12187, 114 pp. [Available from Engineering Physics and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831.].

  • Holland, W. R., and F. O. Bryan, 1989: A high resolution simulation of the wind-and thermohaline-driven circulation in the North Atlantic Ocean. Parameterization of Small-scale Processes, Proc. ‘Aha Huliko’a 89, Hawaii Institute of Geophysics, Honolulu, HI, 99–116.

  • Manabe, S., R. J. Stouffer, M. J. Spelman, and K. Bryan, 1991: Transient responses of a coupled ocean–atmosphere model to gradual changes in atmospheric CO2. Part I: Annual mean response. J. Climate,4, 785–818.

    • Crossref
    • Export Citation
  • Mesinger, F., and A. Arakawa, 1976: Numerical methods used in atmospheric models. GARP Publication Series No. 17, 64 pp. World Meteorological Organization. [Available from World. Meteor. Org., Case Postale No. 5, CH-1211, Geneva 20, Switzerland.].

  • Mitchell, J. F. B., and J. M. Murphy, 1995: Transient response of the Hadley Centre coupled ocean–atmosphere model to increasing carbon dioxide. Part II: Spatial and temporal structure of response. J. Climate,8, 57–80.

    • Crossref
    • Export Citation
  • Pacanowski, R. C., 1995: MOM 2 documentation, user’s guide and reference manual. GFDL Ocean Group Tech. Rep. 3, 232 pp. [Available from Geophysical Fluid Dynamics Laboratory/NOAA, Princeton University, Princeton, NJ 08542.].

  • ——, K. Dixon, and A. Rosati, 1990: The GFDL Modular Ocean Model users guide, version 1.0. GFDL Ocean Group Tech. Rep. 2, 18 pp. [Available from Geophysical Fluid Dynamics Laboratory/NOAA, Princeton University, Princeton, NJ 08542.].

  • Sarmiento, J. L., R. D. Slater, M. J. R. Fasham, H. W. Ducklow, J. R. Toggweiler, and G. T. Evans, 1993: A seasonal three-dimensionalecosystem model of nitrogen cycling in the North Atlantic euphotic zone. Global Biogeochem. Cycles,7, 417–450.

    • Crossref
    • Export Citation
  • Semtner, A. J., 1974: A general circulation model for the World Ocean. Department of Meteorology, University of California, Los Angeles, Tech. Rep. 9, 99 pp.

  • Washington, W. M., and G. A. Meehl, 1989: Climate sensitivity due to increased CO2: Experiments with a coupled atmosphere and ocean general circulation model. Climate Dyn.,4, 1–38.

    • Crossref
    • Export Citation
  • Webb, D. J., 1993: An ocean model code for array processor computers. Internal Document No. 324, 21 pp. Institute of Oceanographic Sciences, Godalming, United Kingdom. [Available from National Oceanographic Library, Southampton Oceanography Centre, Empress Dock, Southampton S014 3ZH, United Kingdom.].

  • ——, 1996: An ocean model for array processor computers. Comput. Geosci.,22, 569–578.

    • Crossref
    • Export Citation

Fig. 1.
Fig. 1.

Organization of the slave oceanic program and subroutine step. Message passing between slaves is handled in subroutine step. The inner pair of message arrows refer to the free surface model fields, the outer pair to the remaining fields. Messages to other slaves are sent only at the positions shown but may be received at a number of additional points in the program. The organization of the moma generic program is similar but without the message passing.

Citation: Journal of Atmospheric and Oceanic Technology 14, 1; 10.1175/1520-0426(1997)014<0175:AMOGCM>2.0.CO;2

Fig. 2.
Fig. 2.

An example of how a model ocean may be partitioned among 16 slave processes. The shape of the regions should be chosen to balance the workload and to minimize message passing.

Citation: Journal of Atmospheric and Oceanic Technology 14, 1; 10.1175/1520-0426(1997)014<0175:AMOGCM>2.0.CO;2

Fig. 3.
Fig. 3.

Example, taken from a different ocean model, of the slave core and halo regions for process 4, showing its area of responsibility made up of its core and inner halo regions. The outer halo includes points belonging to other processes whose data is needed by process 4. Zeros are used to denote land. IMT_S and JMT_S denote the maximum size of the slaves arrays. IMT and JMT denote the region actually used.

Citation: Journal of Atmospheric and Oceanic Technology 14, 1; 10.1175/1520-0426(1997)014<0175:AMOGCM>2.0.CO;2

Fig. 4.
Fig. 4.

Plot of power against the number of processors when the work per processor is kept constant.

Citation: Journal of Atmospheric and Oceanic Technology 14, 1; 10.1175/1520-0426(1997)014<0175:AMOGCM>2.0.CO;2

Fig. 5.
Fig. 5.

Processor efficiency as a function of the number of processors when the work per processor is kept constant.

Citation: Journal of Atmospheric and Oceanic Technology 14, 1; 10.1175/1520-0426(1997)014<0175:AMOGCM>2.0.CO;2

Fig. 6.
Fig. 6.

Plot of power against the number of processors when the total problem size is kept constant.

Citation: Journal of Atmospheric and Oceanic Technology 14, 1; 10.1175/1520-0426(1997)014<0175:AMOGCM>2.0.CO;2

Fig. 7.
Fig. 7.

Processor efficiency as a function of the number of processors when the total problem size is kept constant.

Citation: Journal of Atmospheric and Oceanic Technology 14, 1; 10.1175/1520-0426(1997)014<0175:AMOGCM>2.0.CO;2

Table 1.

Model performance on Cray T3D, when the work per processor per time step is kept constant. Tests were carried out using interactive job queues, except for the 32- and 64-processor tests, which used batch queues.

Table 1.
Table 2.

Model performance on Cray T3D, when the problem size is fixed. The length of channel allocated to each slave oceanic processor is 64 grid boxes divided by the number of processors. Tests were carried out using interactive job queues, except for the 32- and 64-processor tests, which used batch queues.

Table 2.
Table 3.

Model performance on a cluster of workstations when the work per processor per time step was kept constant. Minimum times occurred when there were few other users of the workstation system. Maximum times occurred at times of heavy usage.

Table 3.

* A laboratory of the Natural Environment Research Council.

1

An alternative approach, designed specifically for use on the Connection Machine, has been described by Dukowitz et. al. (1993).

2

At present moma and moma.pvm contain only a limited number of the options available with MOM. However, the core codes have many similarities, so it is usally straightforward to transfer an option from one code to the other. The only known exception is the MOM Fourier filtering option that would need extra message passing in moma.pvm.

3

The codecan be obtained by anonymous ftp from internet site socnet.soc.soton.ac.uk, in file pubread/occam/moma.pvm_vl.9.tar.Z. The unix commands “uncompress” and “tar” are needed to uncompress the file and to split it into its separate components.

4

On the Cray T3D used for the tests, in which each processor had 64 Mbyte of memory, this represented approximately 25% of the maximum region of ocean that could be handled by the processors.

Save