## Abstract

A regional spectral model (RSM) is developed at the Taiwan Central Weather Bureau (CWB). It is based on the same model structure, dynamics, and physics of the CWB global spectral model (GSM) and the perturbation concept of the National Centers for Environmental Prediction (NCEP) RSM for lateral boundary treatment. The advantages of this new regional model include minimization of possible inconsistency between GSM and RSM through lateral boundary influence and reduction of resources used to manage and maintain the model.

One-dimensional decomposition is utilized to slice the model into subdomains to run on a massive parallel-processor machine. The Message-Passing Interface (MPI) is adopted to communicate among each subdomain. The computational dependency, such as the summation in spectral transformation, is a restriction for the decomposition, so that the reproducibility using different numbers of processors is achieved. The performance in terms of wall-clock time follows the theoretical curve of parallelization. It can reach 95% parallelization by “homemade” PC Linux cluster, and 90% by CWB Fujitsu VPP5000.

One case is selected to perform 2-month integration in a simulation mode and a forecast mode. The results indicate a reasonable monsoon frontal evolution as compared with analysis, and it has similar or less root-mean-square error (rmse) as compared to that of CWB GSM. The same run with NCEP RSM nested into CWB GSM shows a larger rmse than CWB RSM; it demonstrates the advantage of having the same model structure, dynamics, and physics between CWB GSM and CWB RSM.

## 1. Introduction

The global atmospheric numerical model has been used for years to simulate and forecast global climate by the climate researchers. Due to the limitation of the computer resources, the global atmospheric model has been used with coarse horizontal resolution such as several hundreds of kilometers for forecasting large-scale climate phenomena. Nevertheless, the mesoscale features, such as the complex terrain-induced mesoscale precipitation, gradually gain attention by the climate community; thus, there is a trend to use regional models for climate studies or forecasts, such as Giorgi and Bates (1989), Dickinson et al. (1989), Giorgi (1990), Kida et al. (1991), Jones et al. (1995), Sasaki et al. (1995), Ji and Vernekar (1997), Caya and Laprise (1999), and others. Several national and international climate prediction centers and institutes have also followed this trend.

The National Centers for Environmental Prediction (NCEP) regional spectral model (RSM) developed by Juang and Kanamitsu (1994) has been applied over different regional domains for regional climate simulations, such as Hong and Leetmaa (1999) and Hong et al. (1999). Although NCEP RSM is suitable for regional climate simulation, it may not be good to introduce a completely new regional model at the Central Weather Bureau (CWB), because extensive human resources may be required for maintenance. Instead of adopting NCEP RSM to CWB, the concept of the NCEP RSM can be adopted to develop a CWB RSM.

The concept of NCEP RSM comprises two elements—one is the same model structure, dynamics, and model physics as its own global spectral model (GSM), the other is the perturbation spectral computation. The former can lead to a reduction in the cost for maintaining CWB RSM. Only the lateral boundary and perturbation treatment of the CWB RSM present new maintenance concerns. The second aspect of the concept provides wave selection to the CWB RSM. Since there is no wave longer than the regional domain in perturbation, the large-scale field from CWB GSM will be kept throughout the entire integration. Thus, the CWB RSM should be suitable for long-term integration, possibly with minimal large-scale drift (Juang and Hong 2001).

To develop a model for operational purposes, performance is the major concern. The performance has two major aspects—one is the consistent well-behaved prediction and the other is the speed of the integration. Both should have equal importance. For example, no matter how well the model can predict the future, it will be of no use if the model cannot finish its integration in time for forecasters to use, and vice versa. In recent years, most operational models take advantage of massive parallel-processor (MPP) machines to improve their performance in terms of computation speed. Thus, it is necessary to implement MPP capabilities for the operational model. Since MPP is a distributing computation, the model domain is required to be sliced into several subdomains for multiprocessors. And it requires data exchange among different subdomains. The most common library with FORTRAN binding to use for data exchange is the package called message-passing interface (MPI), which can be referenced in Gropp et al. (1999).

The implementation of CWB RSM from CWB GSM and NCEP RSM, and descriptions of the preprocessor and postprocessor are discussed in section 2. The model decomposition and development of interface routines to link between model and MPI library are discussed in section 3. The performance of the MPI implementation is shown in section 4. Meteorological results from one case are illustrated in section 5. The conclusions and planned future work are described in section 6. Note that, even though the model dynamics, physics, and nesting strategy in section 2 are published in previous literature (Juang and Hong 2001), the concept, implementation, and performance of MPI code in sections 3 and 4 are original results. Furthermore, the results of monthly integration in section 5 provide additional evidence of successful spectral nesting with the perturbation method advocated in Juang and Hong (2001).

## 2. Implementation of the CWB RSM

### a. Model structure, dynamics, and physics of CWB RSM

The CWB RSM is not a modified version of the NCEP RSM. Instead, it is based on the model structure, model dynamics, and model physics of the CWB GSM (Liou et al. 1997), which is the forecast model of the second-generation global forecast system (GFS) at CWB. First, all computational procedure in the CWB GSM model code is adopted in the CWB RSM. The naming of all variables used in CWB GSM are passed to CWB RSM, and subroutine names in CWB GSM are used in the CWB RSM with the prefix *r.* The prognostic equations used in CWB GSM (see Eqs. (4)–(8) in Liou et al. 1997) are for divergence, vorticity, virtual potential temperature, specific humidity, and surface pressure with the flux form in a primitive hydrostatic system on sigma coordinates. To satisfy the need to save variables in spectral transform, the divergence and vorticity equations are replaced by horizontal momentum equations, and the flux form is replaced by advection form. Additionally, the mapping factor is included due to map projection for limited-area domains. Thus, the thermodynamical prognostic equations in sigma coordinates used here are

and the diagnostic equations used are

where *u** = *u*/*m* and *υ** = *υ*/*m* are pseudo wind speed, *m* = cos*ϕ*_{0}/cos*ϕ* is map factor of Mercator projection, *s**^{2} = *u**^{2} + *υ**^{2}, *π* = (*p*/*p*_{0})^{κ} = *π*_{s}*σ*^{κ}, *κ* = *R*_{d}/*C*_{p}, *π*_{s} is *π* at surface, *ϕ* is *gz* with respect to mean sea level, *f* is Coriolis force, *θ*_{υ} is virtual potential temperature, *q* is specific humidity, and *σ̇* is vertical coordinate velocity. Though the form of the equations is different from CWB GSM, the model dynamics as a hydrostatic system on sigma coordinates are the same. And the horizontal diffusion, time filtering, and semi-implicit time integration scheme used in CWB GSM are used in CWB RSM.

The model physics in CWB RSM are all adopted from CWB GSM without adjustment. The detailed description of the GSM model physics can be found in Liou et al. (1997). It can be condensed here. They include parameterization schemes of two-layer soil with Newtonian cooling from deep soil, Monin–Obukhov surface fluxes, K-theory vertical turbulence mixing, Harshvardham-type shortwave and longwave radiation transfer, relaxed Arakawa–Schubert deep cumulus convection, Tieke-type nonprecipitation shallow convection, grid-scale condensation with relative humidity threshold of 98%, and Palmer-type gravity wave drag.

### b. Adopt perturbation spectral nesting from NCEP RSM

The computational techniques of the perturbation nesting for limited-area spectral modeling are adopted from NCEP RSM. A detailed description of NCEP RSM can be found in Juang and Kanamitsu (1994) and Juang et al. (1997). Detailed computational techniques of the perturbation spectral nesting are described in Juang and Hong (2001); a brief description follows. For the model preparation, regional data and global data are read in, and the regional domain perturbation (or deviation) is obtained by

where *A* is any variable at a grid point at any given *σ* surface for perturbation (superscript *p*), regional value (superscript *R*), and global value (superscript *G*). Note that, the perturbation (or deviation) is from two models on the same *σ* surface, not on the same physical height. Even though *A*^{R} and *A*^{G} are at different physical heights, the deviation between them is consistent on the same sigma coordinates and satisfied the equation set mathematically. The equation set can be separated as a perturbation system and a base field system on the same coordinate without any approximation. Two-dimensional sine or cosine grid-to-spectral transformation changes the perturbation from gridpoint space to spectral space on all model constant sigma surfaces.

After the preparation, all base fields are in gridpoint spaces and all perturbations are in spectral space. A dynamic integration is conducted as follows. All derivatives of base field are computed with third-order finite difference in gridpoint space, and the derivatives of perturbation are computed in spectral space. The perturbations and their derivatives are then transformed from spectral space to gridpoint space. After summation of the base field and perturbation in gridpoint space, the full-field tendency is obtained by the nonlinear computation [see Eqs. (1)–(5)], and the difference between full tendency and base field tendency is computed to obtain the perturbation tendency as

where *F* represents total forcing of the right-hand side in Eqs. (1)–(5). In order to satisfy the lateral boundary condition and to reduce lateral boundary noise due to transformation, the lateral boundary relaxation (Juang and Kanamitsu 1994) is applied to relax lateral boundary values approaching the base field and added to the perturbation tendency with *e*-folding of 1 h. Last, the spectral transform from grid point to spectral is performed on the perturbation tendency.

After dynamical integration, the updated perturbation is transformed from spectral space to gridpoint space, and added with the updated base field at gridpoint space before it is used for physics computation in CWB GSM. Once the physics forcing is computed, the change due to the model physics is transformed from gridpoint space to spectral space and added into the perturbation. In this case, it is a time-splitting method for model dynamics and model physics. Each time step the aforementioned computation is repeated until the end of forecast, then the perturbation is transformed from spectral space to gridpoint space and added to the base fields before they are written to an output file.

### c. Preprocessor and postprocessor

The initial input for the regional model is prepared through a regional preprocessor. A regional preprocessor comprises three steps: 1) the definition of the regional domain; 2) the preparation of the regional file of surface fields, especially for creating regional mountain; and 3) the preparation of the regional file of atmospheric fields.

Definition of the regional horizontal domain contains the selections of the map projection, the reference location in terms of latitude and longitude, grid resolution, and gridpoint numbers in horizontal directions. In *x* direction, the number of the grid intervals has to be factors of only 2, 3, and/or 5 for fast Fourier transform (FFT). In the *y* direction, the number of the grid points has to be an even number for Fourier transform due to symmetry and asymmetry. FFT is applied only in *x* direction due to the slab design of the model computation.

All the regional fields are obtained by interpolation from global data to regional domain except terrain-related fields. Bilinear interpolation is applied to surface fields, and bicubic interpolation is applied to obtain atmospheric fields from global data. The regional terrain is obtained by bilinear interpolation from 5 min by 5 min global topographic data. Since the model terrain has higher resolution than the global model terrain, a cubic-spline interpolation with tension 10 is applied to interpolate data vertically from global sigma surfaces to regional sigma surfaces. The initial treatment of the terrain blending between the regional model and a coarse-grid or global model (Hong and Juang 1998) ensure smoothed lateral boundary fields as well as a better mass conservation.

A postprocessor is written to read all output files, including surface file and sigma file. The surface file contains most of physics results and surface fields, such as surface temperature, soil moisture and temperature, and rainfall etc. The sigma file contains atmospheric fields on model sigma surfaces. The postprocessor is used to translate atmospheric fields on sigma surfaces to pressure surfaces, and it also computes mean sea level pressure, geopotential height by hydrostatic relation and other simple diagnostic fields.

## 3. Multiparallel implementation with message-passing interface (MPI)

The massive parallel-processor cluster computer is the main stream of the current supercomputer for numerical modeling under a multiparallel environment, and the “message-passing interface” is the most commonly used package for MPP. Implementation of MPI into CWB RSM is a crucial step in model development for operational purposes. Furthermore, due to the fast pace of the computer hardware evolution, for example, from vector to multithread, multithread to multinode, and multinode with multithread, the MPI implementation here will be as added-on subroutines to the original model; the model structure and computational sequence are not changed. The C preprocessor is adopted to make sure the model can be run with different numbers of processors including a single processor.

There are several different ways to slice the entire model domain into pieces to take advantage of a MPP cluster computer. However, the reproducibility in terms of identical binary results has to be considered regardless of the numbers of processor used. In order to satisfy reproducibility, any summation such as spectral transformation or vertical integration has to be in the same group with all necessary grid points. All the model grid points in one dimension at least have to be in the same processor due to computational roundoff in summation. For a three-dimensional model, there are two dimensions that can be used for slicing. If a model is sliced in two dimensions, it is called 2D decomposition, and if it is sliced in one dimension, it is called 1D decomposition. Two dimensions has the advantage to use more processors than 1D because there are more gridpoints in two dimensions to be sliced. But 1D can use any number of processors and 2D cannot be used with a prime number of processors.

The testing results from the global spectral model with 1D and 2D decomposition at the European Centre for Medium-Range Weather Forecasts (ECMWF) Fujitsu VPP5000 by Juang and Kanamitsu (2001) indicates that 1D decomposition is faster than 2D because array length is much longer in 1D than 2D, and the longer the array, the faster the computation in vector-processor machines. Though 2D has an advantage as mentioned earlier, it cannot take advantage of the vector processor, such as VPP 5000 at CWB, thus 1D decomposition is used here. Note that, since the vector computation is faster than scalar, it will be fine to use a fewer number of processors due to the limitation of 1D decomposition.

Figure 1 shows the schematic diagram for 1D decomposition used here with an example of six processors. The direction of computing process from spectral space to gridpoint space is marked by thin arrows, and the direction from gridpoint space to spectral space is marked by thick arrows. A step-by-step description from gridpoint space to spectral space in Fig. 1 is followed. First, the input data at gridpoint space is read in and scattered into six pieces for six processors by slicing in *y* direction as shown in the cube at the left bottom corner of Fig. 1, and fast Fourier transform is applied to each processor, from the cube at the left-bottom corner to the cube at the right-bottom corner. Prior to performing Fourier transform in *y* direction, the right-bottom cube is transposed into the cube at the right-top corner. The way to transpose the right-bottom cube to right-top cube is shown in Fig. 2. Each processor separated by solid lines slices itself into six pieces by dashed lines, keeps one belonging to itself and sends out the remaining five pieces to other processors, and receives five pieces from other processors as well. After transposing, Fourier transform in *y* direction can be performed and goes from right-top cube to left-top cube. Thus, it completes the transformation from gridpoint space (left-bottom cube) to spectral space (left-top cube). Vice versa, from spectral space to gridpoint space follows the thin arrow through the reverse way.

The disadvantage of the 1D decomposition here is that Fourier transform has to be separated into two parts. If 2D Fourier transform must perform, the decomposition would be in vertical layers. Since the number of grid points in the horizontal direction is always larger than that in the vertical direction, more processors can be used when slicing in horizontal directions. It is the reason why slicing in *x* and *y* directions instead of *z* is applied.

After data is read in and transformed to spectral space, the following four sequences are performed for each time step: 1) transform from spectral to gridpoint space by thin arrows, 2) compute all nonlinear model dynamics and model physics, 3) transform from gridpoint to spectral space by thick arrows, and 4) perform linear computation, such as semi-implicit and time filter in spectral spaces.

## 4. The speedup performances and wall-clock time saving

In this section, the performance of the model with MPI in terms of speedup is shown. The speedup, *S*_{p}(*n*), is defined as the time spent by one single-processor run, *T*_{1}, divided by the time spent by *n* processor run, *T*_{n}. It can be expressed as

For the perfect speed up, *S*_{p}(*n*) = *n,* the entire code can be parallelized and there is no time wasted in communication. However, there are sequential codes that cannot be parallelized and/or there is time spent for communication, thus, *S*_{p}(*n*) is always less than *n.* Let *s* be the time spent by sequential code in the computing, and *p* be the total time spent by the parallel code in computing with *n* processors. Assuming the time spent in communication is negligible, then *T*_{n} can be expressed as

As we substitute *T*_{1} and *T*_{n} into *S*_{p}(*n*), we obtain

If *n* approaches infinite, the *S*_{p} approaches 1 + *p*/*s.* For example, if 80% of computing time can be parallelized, the maximal speedup has an asymptotic value of 5, and 90% has maximal speedup of 10 times. These two theoretical speedup values will be used in figures of this section to evaluate the implementation of the MPI parallelized code. If the results of the model are close to the theoretical values, the implementation of MPI is successful with negligible time consumed for communication. In this condition, it can be called linearized parallelization and represents a successful MPI implementation.

### a. Results from “homemade” PC Linux cluster

The components of the homemade PC Linux cluster used here are illustrated in the appendix. Figure 3a shows the results from the homemade PC Linux cluster by running CWB RSM with dimensions of 73, 72, and 18 in the *x, **y,* and *z* directions, respectively, and the resolution is 45 km with time step of 240 s (see experiment 1 in Table 1). The *x*-axis is the number of processors used; *y*-axis is the speedup, *S*_{p}(*n*). The symbols, +, ×, □, and ⋄ indicate the results from the periods of 0–12-, 12–24-, 24–36-, and 36–48-h integrations. The thick solid curve is the result averaged over four periods. The three thin solid curves indicate the theoretical results from 100%, 90%, and 80% of parallelization. The mixed performances of each period could be attributed to the machine-related unbalanced computation between processors by the operational systems. Nevertheless, the averaged performance indicates that more than 92% of computations are parallelized, and the worst individual run is 90% parallelized, which includes communication waste.

Figure 3b shows the same as in Fig. 3a, but the resolution of the model is 20 km with dimensions of 163, 158, and 18 in the *x, **y,* and *z* directions, respectively. For 20-km resolution, the time step is 90 s (see experiment 3 in Table 1). Since the dimension is larger and time step is smaller as compared to the previous 45-km runs, more computation is anticipated. The speedup rate should be higher if the sequential portion is about the same. From the results, it indicates that the averaged performance is more than 95%, and 91% for the worst individual case.

The averaged performance in Fig. 3 is between 93% and 96% of parallelization, but there is an indication of higher parallel performance with a fewer number of processors. For example, Figs. 3a,b show superlinear performance with two processors, then changes to 93% for a number of processors larger than 2. It may be due to PC configuration and/or communication overhead.

### b. Results from CWB Fujitsu VPP5000

The next test is performed at CWB VPP5000. Figure 4 shows the speedup results with the same setup as in Fig. 3, but up to 10 processors. Figure 4a shows that 80% or more of the computational cost can be parallelized for 45-km resolution with 73, 72, and 18 grid points as the experiment 1 in Table 1. Figure 4b shows that slightly more than 90% of computation in parallel for 20-km resolution with 163, 158, and 18 grid points as the experiment 3 in Table 1. Since VPP is a vector machine and the parallel portion of the entire model code is favorable to vector computation, the ratio of sequential portion to the parallel portion in terms of computational wall-clock time is larger than the scalar machine in the condition of negligible communication waste. Consequently, the speedup score is lower, but the wall-clock time is considerably faster due to vector computing. We can claim that communication cost is negligible because the curve of speedup is close to the theoretical curve [Eq. (12)].

There are some differences between Figs. 3 and 4. The first difference is that the performances of all periods coincide in Fig. 4 as compared to those in Fig. 3. It indicates that the tests on VPP are conducted under the very stable machine environment, especially where no other jobs are running. The second difference is that the speedup scores are along the theoretical thin solid curves but there is a slightly higher speedup with a larger number of processors in Fig. 4. It indicates that it can be used with a larger number of processor without losing parallelization. Furthermore, the low speedup score indicates that it will be cost-effective to optimize the sequential portion of the computation for VPP5000, but not for the homemade Linux PC cluster.

The speedup plot shows the quality of the model performance with respect to the theoretical performance in terms of percentage of parallelization. But, it does not reflect how fast the performance is in term of wall-clock time. In general, VPP, a vector machine, is faster than the homemade Linux PC cluster. For example, for 12-h integration in the 45-km cases, VPP spent 300 s with 1 processor, 100 s with 5 processors, and 75 s with 10 processors, and PC spent about 1200 s with 1 processor and 300 s with 5 processors. For 20-km cases and 12-h integration, VPP spent about 4200 s with 1 processor, and 1200 s with 5 processors, and 760 s with 10 processors, and PC spent about 24 000 s with 1 processor and 5600 s with 5 processors. These are significant savings for climate integration, especially for higher resolutions with large numbers of grid points and a small time step. For example, 20 km for VPP, 1-month integration will require 70 h (about 3 days) with 1 processor, but only 12 h and 40 min (about a half-day) with 10 processors. It is a significant wall-clock time savings, and it can be improved even more on sequential portions, as mentioned.

One of the important results from multiprocessors is that the binary results from all different number of processors are all identical under any given platform. In other words, the results in binary among different numbers of processors are the same in Linux PC, and the same conclusion is found from the results of VPP5000.

## 5. The preliminary results from model integration

Two sets of integration results in Fujitsu VPP5000 at CWB are discussed here. The first one is a simulation mode and the second one is a forecast mode for long-range integration up to 2 months, with resolution of 30 km shown in Table 1, experiment 2. These runs aim to evaluate how good the perturbation behaved and how well the model preserved the base field for long-range integration.

The initial condition is given by CWB GSM analysis and chosen arbitrarily as 0000 UTC 1 May 1998 for all runs. The base fields at lateral boundary for every 12 h are all from CWB GSM analysis for the simulation mode or from GSM forecast for the forecast mode. The resolution of GSM analysis or forecast is T170 and 18 layers. In the forecast mode, CWB GSM, first, is integrated for 2-month output at every 12-h interval, then the initial and 12-hr interval output of CWB GSM are used to drive CWB RSM. The analyzed sea surface temperature is provided daily for simulation and forecast modes. The figures shown after Fig. 5 for CWB RSM are the entire model domain without hiding the lateral boundary zones. Mercator projection is used for all cases with true latitude at 23°N.

### a. The monthly integration in a simulation mode

It is known that the predictability of monthly or seasonal forecast is very low, thus an ensemble forecast is commonly used. Nevertheless, it is true that the well-predicted ensemble forecast is based on the reasonable prediction of each member. Furthermore, though it is debatable that a regional model should keep the same large-scale patterns as those from the driven model, it is a good starting point for regional long-range integration, otherwise, it is difficult to distinguish whether the large-scale drift is from the lateral boundary or from internal areas. Based on these arguments, we examine the forecast with one member to check its bias and drift from the large-scale driven field. Thus, the ensemble forecast with different initial conditions will not apply here, and should be examined in the future.

Figures 5a–e show the first month mean (May 1998) of mean sea level pressure (MSLP), 850-hPa temperature, 700-hPa relative humidity, 500-hPa geopotential height, and 250-hPa wind speed from experiment 2 of CWB RSM in the simulation mode. The large-scale patterns in the driven fields from the CWB GSM are shown in Fig. 6. Generally speaking, the large-scale patterns of Figs. 5 and 6 agree well, but there are mesoscale features in Fig. 5. In addition, there are magnitude changes or bias in CWB RSM as compared to those in CWB GSM; for example, the invert frontal system is stronger at MSLP, the temperature is warmer at 850 hPa, the relative humidity is lower at 700 hPa, the height is higher at 500 hPa, and wind is stronger at 250 hPa. The results from June are similar to those from May.

Next, 10-day mean of the observations and forecasts are compared. Figures 7 and 8 show all the 10-day means for May and June from GPCP rainfall in mm day^{−1} and analyzed wind fields at 850 hPa. They compared with the 2-month results of model integration from CWB RSM in Figs. 9 and 10. The Global Precipitation Climatology Project (GPCP) heavy rainfall follows the evolution of a monsoon system. It was moving from the northern ocean of Taiwan during early May, to the Philippines during late May and then back to north of Taiwan the next month. The corresponding 850-hPa analyzed jet follows the same path as the heavy rainfall. Nevertheless, the wind is stronger during June than May, and increases during the last 10-day mean of May.

Figure 9 shows the 10-day mean precipitation in mm day^{−1} from CWB RSM, which should be compared to GPCP in Fig. 7. Even though there are erroneous heavy rains along the lateral boundaries and over the Philippines, the major rainfall patterns are close to the observations in Fig. 7. It also shows a trend of increased rainfall starting the last 10-day mean of May from the CWB RSM. The evolution of the 850-hPa wind shown in Fig. 10 from CWB RSM clearly indicate the well-simulated monsoon evolution as compared to observations in Fig. 8. Even though the wind speed is stronger from the CWB RSM, the evolutions of the jet and the increasing wind speed during the last 10 days of May are almost in phase with the observations in Fig. 8. In summary, for the 2-month integration over this regional domain, CWB RSM has the capability to reproduce the observed features of large-scale monsoon frontal evolution with enhanced mesoscale features.

### b. RMSE from 2-month integration in a forecast mode

In the simulation mode, CWB RSM shows a reasonable performance by following the monsoon evolution of the analysis. Since it is to be used with CWB GSM for monthly and seasonal forecasts, it is reasonable to check its performance in forecast mode. Again, the ensemble forecast is not the main theme in this paper, as mentioned. Instead, preliminary results from single-member forecast are presented to demonstrate the model performance in the forecast mode.

Figures 11a–d show the root-mean-square error (rmse) of CWB RSM and CWB GSM related to CWB analysis for mean sea level pressure in hPa, temperature in K at 850 hPa, height at 500 hPa, and height at 250 hPa in meters for a two-month integration. The similar evolution of the rmse between CWB RSM and CWB GSM indicates the proximity of both large-scale features. Beside 500-hPa heights, CWB RSM has smaller rmse than CWB GSM on average for mean sea level pressures, 850-hPa temperatures, and 250-hPa heights. Thus, CWB RSM provides a better forecast of large-scale pattern than CWB GSM, in addition to its intrinsic mesoscale features due to its higher resolution.

As mentioned in the previous sections, one of the main advantages of having a regional model with consistent model elements such as CWB GSM is to avoid lateral boundary error and to reduce the effort of maintaining a new model. Improvement to CWB GSM will automatically translate to CWB RSM, so that the effort on model development and improvement will be minimal. In order to demonstrate this advantage, NCEP RSM is used to perform the same experiment as CWB RSM by using CWB GSM analysis and forecast in every 12-h interval as its initial and base fields, respectively. Figures 12a–d show the rmse of CWB RSM (solid curves) and NCEP RSM (dotted curves) with respect to CWB GSM for mean sea level pressure in hPa, 850-hPa temperature in K, 500-hPa height in meters, and 250 hPa wind speed in m s^{−1}. Beside mean sea level pressure and upper-level wind speed, the CWB RSM has relatively smaller rmse than NCEP RSM with respect to CWB GSM. It indicates that CWB RSM performs better than NCEP RSM in terms of consistency with CWB GSM.

## 6. Conclusions and discussion

The CWB RSM has been developed, based on the model structure, dynamics, and physics of CWB GSM and on the concept of spectral perturbation nesting of NCEP RSM. The coding style, variable and subroutine names used in CWB GSM are adopted in CWB RSM. The computational steps and the model physics in CWB RSM are also adopted from CWB GSM. It should increase the readability and reduce human resources necessary to maintain an extra model code. Nevertheless, the existing source codes, related to spectral perturbation computation, from NCEP RSM accelerated the early development of CWB RSM. The uniqueness of the NCEP RSM, tested to be suitable for climate integration, is passed into CWB RSM, which relaxes the concerns in possible further development and future problems of seasonal integration.

The MPI has been implemented into CWB RSM to take advantage of the MPE supercomputer at CWB. In order to follow the concept of the same model structure, the transposing method used in CWB GSM for parallelization with MPI has been adopted. The experiences of implementing and running MPI at ECMWF VPP5000 led us to implement 1D decomposition for a vector cluster machine, such as the VPP5000 used in CWB. The performance of 1D decomposition with the homemade PC Linux cluster indicates a possibility of more than 90% of parallelization on average. It indicates that we can achieve performance 10 times faster than the single node if the parallelization is above 90%. Nevertheless, the results of the performance from VPP5000 indicates that it has much more consistent performance and better speedup at a large number than a PC Linux.

The result of the 2-month integration in the simulation mode indicates that the model has the capability to reproduce the large-scale features in terms of frontal movement. In terms of rmse, the results of the 2-month integration in the forecast mode indicate the CWB RSM has an as good or better forecast as compared to CWB GSM. Nesting CWB RSM into CWB GSM has much less drift than that of nesting NCEP RSM into CWB GSM. These results suggest that the implementation of CWB RSM was successful. The evidence of better results from the consistent model physics between CWB GSM and CWB RSM supports the advocacy of spectral perturbation nesting strategy by Juang and Hong (2001).

From this preliminary result, it is indicated that the implementation of the CWB RSM is accomplished, but some fine-tuning is necessary to improve forecast performance and parallelization on VPP5000 to above 95% should be continued for operational use. Merging CWB RSM into CWB GSM to become CWB NSM (nested spectral model) may be the next step along with fine-tuning, so that we can run the integration operationally with GSM and RSM together to save computational resources and avoid the complication of running two models separately. Furthermore, CWB NSM can provide an alternative way to have only a single model to improve and maintain for regional and global seasonal forecasts.

## Acknowledgments

This work was initiated by the third author, while the first author visited Central Weather Bureau during the summer of 1998. The consistent and persistent working of the second author makes this work possible. Thanks to NCEP to allow the CWB visit for about one month. Thanks to Mr. Chan-Hwa Lee, who performed the speedup test in the dedicated mode of the CWB Fujitsu VPP5000. And we would like to thank to Drs. Cheng-Hsuan Lu and Everette Joseph for their proofreading, and to two anonymous journal reviewers for their constructive and instructive suggestions.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX

#### The Components of a “Homemade” PC Linux Cluster

Due to the convenience, the initial development of this system utilized a homemade cluster of personal computers (PCs) running the Red Hat Linux operational system. This PC Linux cluster comprises 5 Toshiba 8000D PCs with Pentium III, 800 MHz, 512-MB memory, 20-GB storage, and 100-MBs network interface card. A D-Link eight-port switcher with 100-MBs Ethernet is used to connect all five PCs together as a local intranet. The Fujitsu FORTRAN 95 compiler is used with MPICH (Gropp et al. 1999) as the library binding for MPI calls. The remote shell (rsh) has to turn on under a user account, so that data communication among them can be allowed. All these are summarized in Table A1. [Note that, MPICH is well described in appendix B in Gropp et al. (1999) and a free library can be obtained online at http://ftp.mcs.anl.gov/pub/mpi], and has been tested in heterogeneous platforms.

In general, we consider equivalent capability of PCs, and it will be better if the connection is a higher speed. The software components can be any kinds as long as they work together, such as UNIX operational system, FORTRAN with MPI library by using rsh or ssh (secure shell) to communicate with each PC. Finally, the executable code has to be accessible to all PCs, either copy it to every PC or export the disk to all PCs.

## Footnotes

*Corresponding author address:* Dr. Hann-Ming Henry Juang, W/NP, Room 201, WWBG, NOAA, 5200 Auth Road, Camp Springs, MD 20746-4304. Email: henry.juang@noaa.gov