The dynamical core of the Mesoscale Compressible Community (MC2) model is described. Ensemble forecast techniques for high-resolution mesoscale simulations are applied to assess the impact of floating point optimization, mathematics libraries, and processor configuration on forecast accuracy. It is shown that the iterative solver in the dynamical core is most sensitive to processor configuration, but it also shows weak sensitivity to the usage of fast mathematics libraries and floating point instruction reordering. Semi-implicit pressure solver errors are amplified in the physical parameterization package, which is sensitive to small pressure differences and feeds back to the dynamical solution. In this case, local rms spreads are around 1°C in temperature by the end of a 42-h forecast. It is concluded that careful validation is required when changing computing platforms or introducing fast mathematics libraries.
In recent years there has been a rapid shift in the weather forecasting and research communities away from the use of centralized forecast products toward running mesoscale models on a regional or even local basis. For example, both the University of Washington (UW) and the University of British Columbia (UBC) have set up real-time forecasting systems. The Pennsylvania State University–National Center for Atmospheric Research (Penn State–NCAR) fifth-generation Mesoscale Model (MM5) has been run in production mode at many sites (Mass and Kuo 1998), including UW, whereas the Mesoscale Compressible Community model (MC2) and Wisconsin Nonhydrostatic Modeling System are run at UBC. In both cases the model production runs are carried out on microprocessor-based shared or distributed-memory clusters (Baillie et al. 1997). To sustain a significant percentage of peak performance, code restructuring in combination with compiler optimizations and fast mathematics (“math”) libraries are required. The most aggressive optimizations and libraries must be applied with care because they may seriously degrade the accuracy of a fluid flow simulation if the model code or the compiler is not robust. It is for this reason that we believe ensemble analysis techniques can be useful in assessing the impact of optimization strategies on atmospheric models with complex physical parameterization packages.
In this note we apply ensemble forecast techniques to assess the effect of floating point optimizations, fast math libraries, and processor configuration on model error, which have been shown to dominate in the first few hours of forecast time (Stensrud et al. 2000; Orrell et al. 2002). Our study is based on the nonhydrostatic fully compressible MC2 limited-area atmospheric model. The MC2 is widely used in Canadian universities and at Environment Canada for mesoscale and microscale atmospheric research. A detailed description of the adiabatic kernel and numerical formulation of the MC2 model with open boundaries is given in Thomas et al. (1998). In particular, the model employs a semi-implicit semi-Lagrangian time discretization scheme. The use of a terrain-following height coordinate (Gal-Chen and Sommerville 1975) results in a nonsymmetric linear system to solve every time step for the pressure. An iterative minimal residual Krylov solver has thus been implemented, and the computational efficiency of the model using line relaxation preconditioners designed for both hydrostatic and nonhydrostatic flow regimes is presented. The use of an iterative solver on a parallel computer leads to spread in the ensemble errors, and this spread is amplified by the physical parameterization package in combination with fast math libraries and changing processor configurations.
Atmospheric models based on explicit time stepping schemes such as the Penn State–NCAR MM5 would not be subject to the same forecast error spreads because the pressure is bit-reproducible and does not depend on the processor configuration. Indeed, physical parameterizations are strictly column-based and do not require interprocessor communication for distributed-memory computation. However, several operational centers including Canada (Côté et al. 1998) use semi-implicit models based on iterative solvers, and our results indicate that model validation should also account for the variation in the pressure with processor configuration. To establish that the error spread in the physics is in fact due to the iterative pressure solver, a sensitivity study was performed using one processor by varying the iterative solver convergence tolerance.
2. MC2 model formulation
The MC2 is a nonhydrostatic fully compressible limited-area model formulated with respect to a horizontally homogeneous, time-invariant, and hydrostatically balanced base state reference atmosphere (subscript 0). Thermodynamic variables, temperature T, and normalized log pressure q = ln(p/p0) are decomposed into a sum of the base state and perturbations:
and the equation of state is p = ρRT. The gas constant for dry air is R, γ = cp/cυ and cp − cυ = R, where cp and cυ are specific heats at constant pressure and volume, ρ is density, z is geometric height, and g is gravitational acceleration. After introduction of the base state, the governing equations are given by
where v is the velocity, f is the Coriolis force, B = gT′/T0 is the buoyancy, and c0 is approximately the speed of sound.
The semi-implicit scheme results in an elliptic problem to solve every time step for a log pressure perturbation q′ about a stationary isothermal hydrostatic basic state:
This nonsymmetric system of equations resulting from a terrain-following height coordinate system is solved using the generalized minimal residual (GMRES) algorithm of Saad and Schultz (1986). Skamarock et al. (1997) use the mathematically equivalent generalized conjugate residuals (GCR) algorithm of Eisenstat et al. (1983) in a semi-implicit formulation of a compressible model. The solver convergence criteria are based on the rms divergence. Computational efficiency (overall wall-clock time) is improved by finding a suitable preconditioner, which is often problem dependent. The discretized elliptic operator in a nonhydrostatic pressure solver will be dominated by the vertical terms when the aspect ratio ΔX/ΔZ is large. Therefore, an effective preconditioning strategy is to invert the vertical components of the elliptic operator, and Skamarock et al. (1997) apply a vertical alternating direction implicit (ADI) line relaxation preconditioner.
A vertical ADI line relaxation preconditioner for the n × n linear system 𝗮x = b is based on the splitting 𝗮 = 𝗵 + 𝘃, where the 𝗵 and 𝘃 represent the horizontal and vertical components of the discrete elliptic operator based on centered second-order finite differences. The ADI iteration is derived from the pseudo–time integration of the heat equation ut = Δu + r to steady state, where the matrix 𝗮 represents the discrete Laplacian and
The largest possible pseudo–time step β is chosen so that the above integration scheme remains stable. A slightly more implicit scheme can be constructed using a line Jacobi relaxation scheme. For nonhydrostatic problems on isotropic grids, a fully 3D ADI preconditioner is implemented as in Skamarock et al. (1997). The solution of tridiagonal linear systems of equations implies global data dependencies; thus, a parallel data transposition strategy has been adopted in a distributed-memory implementation of the 3D ADI preconditioner.
For a distributed-memory model of parallel computation, the Ni × Nj × Nk computational grid is partitioned across a PX × PY logical processor mesh. A domain decomposition in the horizontal direction is employed because of the strong vertical coupling in physical parameterization packages and since the number of grid points in the vertical direction is typically one order of magnitude less than in the horizontal. Each processor therefore contains Ni/PX × Nj/PY × Nk points. For both algorithms the communication overhead associated with boundary data exchanges between subdomains is minimal when compared with computations. The 1D vertical ADI and Jacobi line relaxation preconditioners are also well suited to a horizontal decomposition, since the only global data dependency is in the vertical direction within tridiagonal solvers. However, the 3D ADI preconditioner requires global data in each of the three coordinate directions in order to solve tridiagonal linear systems of equations during each ADI sweep. Thus, the right-hand side b and solution xk must be remapped to perform line relaxations in each coordinate direction in turn. Such a remapping takes the form of a data transposition algorithm requiring collective all-to-all communication of O(N3) grid points. Vertical sweeps in the Z direction are performed using the domain decomposition described above. Each processor contains Ni × Nj/PY × Nk/PX grid points for sweeps in the X direction and Ni/PY × Nj × Nk/PX points in the Y direction. ADI sweeps progress from left to right and then right to left as indicated below with arrows representing communication steps:
To compare the computational efficiency of the model using the parallel 1D Jacobi and 3D ADI line relaxation schemes, a quasi-hydrostatic test case was run on a Silicon Graphics, Inc., (SGI) Origin computer with 16 processors. A 120 × 120 × 35 grid at 2.5-km horizontal resolution with model lid set at 23 km was employed in a 30-h mesoscale forecast over the British Columbia lower mainland. The integration consisted of 1800 time steps of length Δt = 60 s using both 1D and 3D preconditioners. The results are summarized in Table 1 using optimization level O3 (described in section 3). Despite the fact that the 3D ADI preconditioner results in a faster solver convergence rate, model execution times are close.
3. Ensemble analysis
The effect of compiler optimizations along with the use of fast math libraries on the accuracy of high-resolution mesoscale forecasts is an unexplored subject. One possible way of studying these effects is to generate an ensemble of forecast runs, regarding optimization and processor (PE) configuration as sources of model error. Given three optimization levels and four processors, two simple ensembles are generated. The same meteorological test case is run 12 times, without including physical parameterization schemes, but varying optimization levels and processor configurations. The 12 runs compose an ensemble, and standard analysis techniques can reveal error growth and forecast spread as a function of optimization, library usage, and floating point instruction reordering. A second ensemble is generated by running the MC2 with the external physics package to determine the relative contribution of model error from the dynamical core as compared with the physics. The use of different PE configurations and optimization levels are considered perturbations to the model, although they may be systematic rather than stochastic.
Three optimization levels were identified for the 195-MHz R10000 processor on an SGI Origin 2000 (in this paragraph, compiler option codes are given in parentheses). The first level of optimization (O1) allows some minimal code changes by the compiler. Interprocedural analysis (-IPA) is applied at the second level of optimization (O2) to improve both instruction and data cache usage through code movement. The most aggressive level of optimization (O3) permits floating point instruction reorderings that could adversely affect the precision of certain computations such as division. Fast math libraries (-lfastm) are also employed and precision may be reduced in exchange for speed. To evaluate these optimizations in the context of a real-time mesoscale forecasting system, the computational domain and related run-time parameters were taken from daily runs of the UBC ensemble forecasting system (http://spirit.geog.ubc.ca/wxfcst). Horizontal resolution is 10 km at 60°N with the model lid at 19 km. The grid contains 120 × 70 points on 35 terrain-following levels, implying the run is quasi-hydrostatic with ΔX/ΔZ = 20. Vertical line Jacobi preconditioning is therefore applied in all our tests.
Initial and boundary conditions are obtained from a coarse grid (ΔX = 30 km) MC2 run starting at 0000 UTC 27 November 1997. A 42-h integration requires 1260 time steps of length Δt = 120 s. For the ensemble that includes physics, the full Recherche en Prévision Numérique (RPN) physics package described in Mailhot et al. (1997) is used. A forecast initialized at 0600 UTC 27 November 1997 is chosen as a benchmark since it is a representative mesoscale case for the British Columbia lower mainland. In particular, it exhibits typical autumn and winter characteristics, including moist, southwesterly flow at low levels; a persistent upper-level trough offshore; and dynamically forced precipitation enhanced by steep topography. The dynamic forcing came from a weak surface cold frontal passage supported by a short-wave trough aloft. Upstream boundary wind speeds were on the order of 10 m s−1 at the surface. Wall-clock execution times (including input/output) are summarized in Tables 2 and 3. Note in particular that single processor performance does not exceed 65 million floating point operations per second (Mflops) or 15% of the rated peak performance of the microprocessor.
Forecast spread suggests model error resulting from choices in processor configuration and optimization level. Unlike many ensemble studies, a control run to characterize deterministic model error is not available. In ensemble forecasting, a control run is often the categorical forecast, initialized with a “best guess” analysis. A single processor run without any optimization could be taken as the control run since it may be what the model developers intended. Given codes of such complexity this is difficult to quantify. Alternately, the control can be any member of the ensemble, chosen randomly. However, one case study cannot robustly determine the statistical properties of the ensemble and a randomly chosen member has no significance. Ensemble forecasters often look for forecast spread and a reduced ensemble-averaged error. The performance of the ensemble mean per se is not of interest here and sensitivity to initial conditions is not an issue. Because this is a single case without a true control run, it is unwise to select a top performer in this experiment or to average the results looking for an improvement. However, significant deviation of one group of forecasts from the others would indicate the presence of model error.
Each ensemble member is initialized identically and common boundary conditions are applied throughout the 42-h integration period. The 0-h forecasts may deviate nominally from each other, because grid staggering and destaggering require calculations that may be handled differently. Boundary value determination is subject to the same errors and by examining the spatial distribution of forecast spread such effects can be quantified. Forecast spread is defined quadratically as
where ψ is any forecast variable, ψ is the ensemble mean, and N is the number of ensemble members.
This study evaluates temperature T and vertical velocity W, because those two parameters largely determine the state of the model. The mass distribution is mostly described by T, while W is unique for a given wind solution. The vertical structure of the spread and the impact of optimization and processor configuration on the RPN physics package are shown in Fig. 1. It is clear for both vertical velocity and temperature that the spread in the ensemble is greater when the model is run with the full-physics package (dashed line). The dynamics do not use the functions available in the fast math library, whereas the physics do in many places. When physics are included, the spread increases with height. Without physics the spread is greater in the middle of the troposphere. In all cases after 36 h of forecast time, the spread of the physics ensemble is at least 2 times as large as the spread of the no-physics ensemble, showing that the library and processor configuration differences affect the physics more than the dynamics. Spreads in Fig. 1 increase rapidly at the beginning of the forecasts and then grow slowly or even decrease in some cases. This behavior is consistent with observed model error that follows a square-root curve (Orrell et al. 2002). Later in the forecast, we would expect sensitivity to initial conditions to take over, giving exponential-like error growth (e.g., Errico and Baumhefner 1987). The small domain here will restrict such error growth, explaining the weak growth in Fig. 1.
Keeping track of how each ensemble member deviates from the ensemble mean gives us an idea of specifically how the fast math library and processor configuration affect the forecast. Figure 2 is a plot of domain-averaged absolute deviations for each member in the ensembles. Lower values indicate less contribution to the spread. Most of the error growth is contained within the first few hours of the forecast. Because of the large spread near the top of the troposphere, that level was chosen for demonstration, and only temperature deviations are shown. The ensemble with physics is on the left (Fig. 2a) and the dynamics-only ensemble is on the right (Fig. 2b). Note the different vertical axis scales and that the deviations are an order of magnitude greater when physics are included. Each configuration is plotted with a different line style, and all 12 members are plotted. Different optimization levels and libraries have nearly identical deviations from the mean, since the curves appear superimposed in Fig. 2. Thus, the processor configuration is almost completely responsible for the ensemble spread. In a global model, error growth is bounded by the climatological variance (Errico and Baumhefner 1987). Common boundary conditions in this ensemble run on a small, limited-area domain further restrict error growth. This results in an error saturation that is well below where it might occur on a global domain. Because the error growth is faster for the ensemble with physics included, saturation appears to be reached near where the curves are flat and the spread may have reached its limit as early as 24 h into the forecast (Fig. 2a). Conversely, many of the curves in Fig. 2b tend upward at the end of the forecast period, suggesting that the slower-growing spread in the dynamics-only ensemble has not yet reached its limit.
Plotting the spread for one forecast time without averaging shows that a variety of calculations performed in the MC2 are susceptible to processor configuration, but the physics are more sensitive. Figure 3 shows the temperature spread approximately 5 m above the surface (first model level) and 19 000 m above the surface for both ensembles, valid near the peak spreads in the physics ensemble (Fig. 2) at forecast hour 30. It is again clear that the spread of the ensemble with physics (Figs. 3a and 3c) is an order of magnitude larger than the ensemble without physics (Figs. 3b and 3d). The 5-m spreads (Figs. 3a and 3b) are smaller, corresponding with Fig. 1, and do not appear to be correlated with topography as might be expected. The mountain ranges in this domain run parallel to the coastline, while the spread pattern for the no-physics ensemble is oriented perpendicular to the coastline, and the physics-included ensemble spread does not have an organized pattern at all. Figure 3c shows the importance of boundary condition contamination, but the large upper-tropospheric spread patterns are also oriented with the topography. Most of these results are for temperature because it is a sensible weather parameter that is easy to understand. The vertical velocity behaves similarly (Fig. 1). Because precipitation processes are closely (and nonlinearly) tied to vertical velocity, these results imply that a large spread in forecast precipitation events could result in some cases, especially when it is weakly forced.
To estimate the relationship between topography and the spread of the ensembles, correlation coefficients between the spread of W and topography are calculated (Fig. 4). Topographically forced vertical velocity is handled entirely in the dynamical core of the MC2, regardless of the external physical parameterizations. Vertical velocities themselves are well correlated with topography, especially with cross-barrier flow as in this case. A strong correlation between the spread of vertical velocity and the topography would indicate that the numerics associated with vertical advection may introduce errors with different processor configurations. Figure 4 shows a positive correlation through most of the forecast, although the correlation is never particularly strong. As might be expected from Fig. 3, the correlation in the upper troposphere for the no-physics ensemble is the greatest and the no-physics correlations are in general higher than the correlations for the physics ensemble. The overall weak correlations suggest that the topographically forced vertical velocities in the model dynamics are not prone to large errors from processor configuration. The correlation reduction in the physics ensemble near the end of the forecast time also indicates that the spread in W is more closely connected with errors resulting from the physical parameterizations, which are decoupled from the topography.
The results presented here suggest that PE configuration is the primary source of error in the iterative pressure solution. The errors by themselves may be small. When a full physical parameterization package is included, the pressure errors propagate through the physics and positively feed back to degrade the full model solution. The RPN physics package treats each atmospheric column separately, so processor configuration should not directly affect the physics output. To further test this hypothesis, runs were made with different elliptic solver convergence tolerances. The runs were made on one PE to eliminate processor configuration as a source of error, varying the iterative solver convergence tolerances from ε = 10−4 to ε = 10−5. Both of these result in converged solutions, but some small differences will result. Again, runs were made both with and without physics, giving a total of four runs. The results are shown in Fig. 5, which can be compared with Fig. 1. Although the overall magnitude of the spreads is smaller, the fact that the physics results in an order-of-magnitude increase in spread over the dynamics-only case is similar.
Perhaps the best way to evaluate the impact of code optimization strategies for microprocessors on atmospheric flow simulations is by using theoretical test cases with analytic or converged solutions. In such cases, a loss of accuracy can be isolated and easily corrected. With a fully configured atmospheric model interfaced to complex physical parameterization packages, the task is more difficult. One possible approach is to view aggressive compiler optimizations, fast math library usage, and parallel-processor topology as possible sources of floating point errors or perturbations of the model and boundary conditions. Ensemble forecast techniques can then be applied to determine the nature of the error growth that may result from these configuration options. If we interpret ensemble spread as the potential for model error from compiler optimization, fast math libraries, and processor configuration, the feedback between the iterative pressure solution and the physics can generate significant errors in the forecasts. Spread from an ensemble with only dynamics is an order of magnitude smaller than a run with the full RPN physics package. The deviations between ensemble members that use the fast math libraries and those that do not are nominal, showing that the spread is primarily a result of processor configuration. This is to be expected for the dynamics-only case.
The iterative solver is the main source of differences, because the pressure is not bit-reproducible because of a global summation when computing inner products in the iterative solver. However, changing processor configurations while using the physics package has the potential to degrade the forecast further. For a realistic forecast, the physics must be included, so it is up to the forecaster to determine whether the errors are within acceptable bounds. In this case, the implied model errors averaged over the domain were relatively small and on the order of 0.5°C in temperature. Additional study is required to quantify the effects that the implied vertical velocity errors can have on sensible weather parameters such as precipitation. Errors of this magnitude may only be important in select cases.
Other published model verification studies can help to put these errors into context. Stensrud et al. (1999) evaluated Eta Model forecasts produced at the National Centers for Environmental Prediction (NCEP), run on a regional domain with ΔX = 29 km. For 12-h forecasts, they found mean absolute errors (MAEs) in temperature that ranged from 0.93° to 1.66°C between 85.0 and 10.0 kPa. At 36 h, the range of errors was from 1.16° to 1.73°C. Hou et al. (2001) verified an ensemble designed to approximate forecast error resulting from initial-condition error and found mean 85.0-kPa temperature errors between 1.0° and 2.0°C at 36 h. Both of those studies used larger domains than this study. With more recognition that model error is large even at short ranges (Stensrud et al. 2000, Orrell et al. 2002), it is important to understand its sources, including those related to processor topology and/or floating point performance. The results here suggest that processor topology can produce errors that are 10%–25% of the magnitude of total initial-condition and model error. These results are conservative because of the small domain. For a larger domain, it is expected that the errors could grow to larger values and they are not limited by the boundary conditions as they are here. Also, the suggested feedback between the pressure solution and the physics solution means that other sources of model error, such as incorrect physical parameters, may result in greater model error than different PE configurations.
It must be noted that experiments performed with an earlier math library and operating system resulted in errors on the order of 5°C, which are large enough to be important most of the time. Those differences were caused primarily by the use of the fast math library. In other words, these results do not extend to all computing platforms. Our results also show that the influence of topography on spread is weak. This was examined by looking at the spread in vertical velocity, because vertical velocity itself is expected to be highly correlated with topography. The no-physics ensemble is generally better correlated with topography. The physics and no-physics ensembles behave differently, suggesting that other factors are at least as important.
These results have implications for investigators trying to verify a numerical weather prediction model over one or more seasons. It is recognized that a proper verification will cover a period during which the model is static. That is to say the physics and the dynamical core do not change over the verification period. If this is important, the number of processors must remain static as well. Within this context, there is no way to tell which forecast is the most accurate because we have not included any information about model biases, observations, or observation errors. Regardless, the results show that processor configuration, and possibly the use of fast math libraries, can lead to forecast error that should be considered when a new compiler, operating system, or fast math libraries are installed.
We thank SGI, NCAR, and the Canadian Meteorological Centre for providing dedicated computer time. The comments of three anonymous reviewers have greatly improved the manuscript.
Corresponding author address: Dr. Stephen J. Thomas, Scientific Computing Division, National Center for Atmospheric Research, 1850 Table Mesa Dr., Boulder, CO 80303. Email: email@example.com
* The National Center for Atmospheric Research is sponsored by the National Science Foundation.