## 1. Introduction

The observations from the global positioning system (GPS) radio occultation (RO) limb sounding technique have been proven to be a valuable source of atmospheric data for numerical weather prediction (NWP) and climate research (Kuo et al. 1998; Zou et al. 1999; Zou et al. 2000; Liu and Zou 2003; Healy et al. 2005; Huang et al. 2005; Cucurull et al. 2006; Cucurull and Derber 2008; Healy and Thepaut 2006). GPS RO data have several advantages, such as no need for calibration, being unaffected by clouds and precipitation, and uniform global coverage. In the middle to upper troposphere, in terms of the refractivity profile, the GPS RO measurements have accuracy comparable with or better than that of radiosondes (Kuo et al. 2005). Since the launch of the Constellation Observing System for Meteorology, Ionosphere and Climate (COSMIC) mission in 2006, approximately 1500–2500 globally distributed GPS RO soundings are provided per day in near–real time. The COSMIC GPS RO soundings are currently being used at several global operational NWP centers, including the National Centers for Environmental Prediction (NCEP; Cucurull and Derber 2008), the European Centre for Medium-Range Weather Forecasts (ECMWF; Healy 2008), the Met Office (UKMO; Rennie 2010), and Météo-France (Poli et al. 2009).

Because of the success of COSMIC, U.S. agencies and Taiwan have decided to move forward with a follow-up RO mission [called Formosa Satellite Mission 7 (FORMOSAT-7)/COSMIC-2] that will launch six satellites into low-inclination orbits in early 2016, and six satellites into high-inclination orbits in early 2018. The COSMIC-2 mission will provide nearly an order of magnitude more RO atmospheric soundings that will greatly benefit the research and operational communities (http://www.cosmic.ucar.edu/cosmic2/).

Depending on the level of data processing, various variables can be retrieved from GPS RO observations for use in data assimilation, such as bending angles, refractivities, and retrieved moisture/temperature profiles (see Kuo et al. 2000, 2004). To account for variations in the atmospheric state along the GPS ray paths, the nonlocal excess phase operator, introduced by Sokolovskiy et al. (2005), has proven to significantly improve the results from assimilation of GPS RO data in several research papers (Sokolovskiy et al. 2005; Liu et al. 2008; Chen et al. 2009; Ma et al. 2009; Shao et al. 2009). However, because of the high computational cost and the parallelization difficulties associated with the nonlocal operator, it has been tested only in some research configurations with a limited number of cases. Parallel implementation of GPS RO data assimilation with a nonlocal operator is urgently needed to advance its applications in both research and operational data assimilation systems.

Because both the local refractivity and nonlocal excess phase operators have been implemented in the data assimilation system for the Weather Research and Forecasting Model (WRFDA; Barker et al. 2012; Chen et al. 2009), the three-dimensional variational data assimilation (3DVAR) approach in WRFDA will be used throughout this paper to demonstrate the parallelization of the GPS nonlocal operator. We believe that the parallel strategies for the nonlocal operator are general and applicable to other parallel data assimilation systems (such as the four-dimensional variational data assimilation approach and the ensemble Kalman filter).

This article is organized as follows. In section 2, we briefly introduce both the local refractivity operator and the nonlocal excess phase operator implemented in WRFDA and their computational costs. Section 3 provides the technical details of how the nonlocal GPS RO operator is parallelized in a domain-decomposition parallel context. The strategy to solve the load imbalance problem is presented in section 4. Section 5 presents some further optimization to save the computational cost. The summary and discussion are given in section 6.

## 2. Local and nonlocal GPS RO operators

*p*is the total atmospheric pressure (hPa),

*T*is the atmospheric temperature (K), and

*q*is the specific humidity (kg kg

^{−1}).

The local GPS RO refractivity operator models the observed refractivity as the local refractivity at the perigee point where the GPS ray is closest to the earth. The background fields of *p*, *T*, and *q* are interpolated horizontally and vertically to the perigee point of the RO observation, and Eq. (1) is used to calculate the local refractivity. The local GPS RO refractivity operator is relatively simple and the computational cost is low.

*S*(excess phase), defined as

*l*is the ray path and

*N*is the refractivity (Sokolovskiy et al. 2005). First, we obtained retrieved vertical profiles from the COSMIC Data Analysis and Archive Center (CDAAC). One radio occultation includes several ray paths; thus, there are several tangent points and the related latitudes, longitudes, heights, the associated refractivities, and the azimuth of an incoming ray, etc., for a single GPS profile. In the nonlocal excess phase operator, the smear effect is already considered. Second, we calculated the model mean heights by averaging the heights of all the model grids on each background model vertical level. The refractivity for the observation on each model mean height is calculated by the vertical average of the observed refractivities near each mean height level. We believe that such a vertical average may still retain enough signals from the observations, without going into detail, that the model cannot resolve (Chen et al. 2009). Last, the averaged refractivities (i.e., the observations) on the model mean heights were integrated along the ray path under the assumption of spherical symmetry. Thus, the model excess phases can be derived by an integral along the ray path as in Eq. (2). Similarly, the model refractivities were integrated along the ray path to obtain the model-simulated

*S*. For the details of the implementations of the nonlocal operator, please refer to Chen et al. (2009), Ma et al. (2009), and Liu et al. (2008).

Compared to the local GPS refractivity operator, the computational cost of the nonlocal excess phase operator is increased dramatically because of both the ray-path-searching algorithm and the integration of refractivity along the GPS ray. Liu et al. (2008) reports that the cost of nonlocal operator versus local operator is at least 100 times greater on a Linux cluster. Moreover, it is not easy to parallelize the nonlocal operator and the difficulty to parallelize the nonlocal operator lies in its integral nature of Eq. (2) along the GPS ray path, which may pass through multiple subdomains of a decomposed model domain. To demonstrate the need to parallelize the GPS RO nonlocal operator, an Antarctic domain with 30-km horizontal resolution, shown as Fig. 1, is chosen to illustrate the computational cost of the GPS RO operator. This is an Advanced Research WRF (Skamarock et al. 2008) domain with a 401 × 401 mesh size and 55 vertical layers between the surface and 10 hPa. In addition to GPS RO data, we also assimilate conventional observational data in this case, such as radio soundings, surface observations, ship observations, etc. Because the computational cost for assimilating the conventional data is trivial compared to that of the GPS RO data with a nonlocal operator, the following discussion will focus on GPS RO data only. Figure 1 shows the locations of the 106 GPS RO profiles within a ±3 h window centered at 1800 UTC 11 December 2007. The 106 GPS RO profiles are assimilated with WRFDA 3DVAR on the National Center for Atmospheric Research (NCAR)’s supercomputer Yellowstone (http://www2.cisl.ucar.edu/resources/yellowstone), and the wall clock time of five iterations of minimization is recorded to demonstrate the computational costs. With a single processing core, it takes only 146 s to assimilate 106 GPS RO profiles by the local refractivity operator. However, 12 762 s (≈3.5 h) are needed to run five iterations with the nonlocal excess phase operator. The ratio of the computational cost of the nonlocal operator versus the local operator is about 87:1 for this case. In terms of the production 3DVAR run with approximately 60 iterations (assumes two outer loops), the wall clock time for serially running this case will be more than 42 h on Yellowstone. Therefore, the cost of using a nonlocal operator with a single processing core is unaffordable, and so the parallelization of the GPS RO nonlocal operator in the data assimilation system is critical for its applications in either research or operations.

## 3. Parallelization strategy

Most data assimilation systems employ the horizontal domain-decomposition method for parallel processing. In data assimilation systems, an observation is usually assigned to the subdomain where it is geographically located. The difficulty in parallelizing the nonlocal operator in the domain-decomposition context is rooted in the nonlocal integral nature of Eq. (2) along the ray paths at all vertical levels above the tangent point and below the model top. The ray path might intercept several subdomains located on different processing cores, and each processing core is only aware of the atmospheric states of its assigned subdomain. Apparently, the most suitable strategy to parallelize the nonlocal GPS RO operator is the ray-path-wise distribution among processing cores (Zhang et al. 2004), which distributes the workload among processing cores equally based upon the number of the GPS ray paths, instead of the geographic location of the observations. However, in our parallelization strategy, we must consider the existing domain-decomposition method to minimize the implementation cost.

Although Eq. (2) is an integration of the refractivity along the ray path, which might go through the whole model domain, we noticed that only one derived variable-model-simulated refractivity (*N*_{mod}) is used for the integration. Therefore, the key to the parallelization strategy is to let every processing core know the global *N*_{mod} for the whole domain. If every processing core (subdomain) is aware of the global *N*_{mod}, then the integration can be performed by its associated processing core along the whole ray path. There are two steps in the parallel implementation: first, each processing core calculates *N*_{mod} for every grid point within its subdomain; second, the parallel collective communications (such as mpi_allgetherv) are used to collect the locally calculated *N*_{mod} from each subdomain to a global *N*_{mod} array, and then broadcast the global *N*_{mod} to each processing core. The costs for this strategy are the additional memory storage for a couple of global arrays on each processing core plus the collective communications for each iteration.

Figure 2 shows the parallel wall clock timing results with up to 512 processing cores on NCAR’s Yellowstone for three experiments. The red bars represent the experiment “Parallel,” which shows the parallel performance of the strategy implemented in this section; the green and blue bars show the parallel performances of the load balance (described in section 4) and additional optimization (described in section 5). With the implementation of the above-mentioned parallelization strategy (experiment Parallel), the wall clock time of a five-iteration minimization is reduced from around 3.5 h with a serial run to 279 s (≈4.5 min) with 512 processing cores. To ensure the correctness of the parallel implementation, the values of cost functions and gradients from both the parallel and serial runs are compared; both runs produce 14-digit identical results and the tiny differences are machine round-off errors.

The wall clock times for a five-iteration minimization of 3DVAR on NCAR Yellowstone.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

The wall clock times for a five-iteration minimization of 3DVAR on NCAR Yellowstone.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

The wall clock times for a five-iteration minimization of 3DVAR on NCAR Yellowstone.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

Figure 3 shows the parallel speedups, which is the ratio of the wall clock times of parallel runs to that of the serial run. The black line is the linear parallel speedup and represents the ideal speedup or acceleration when multiple processing cores are used. The red line represents experiment Parallel. The speedup with 512 processing cores is only 46 times. One may argue that the added collective communications might be part of the cause of this strategy not being cost effective. However, the parallel profiling shows that the added collective communication only cost 1.61 s with 512 processing cores and that the overhead of a few more collective communications in this case is negligible. Therefore, a careful analysis of the parallel algorithm of the GPS RO nonlocal operator will be helpful to understand the reasons.

As in Fig. 2, but for the parallel speedup.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

As in Fig. 2, but for the parallel speedup.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

As in Fig. 2, but for the parallel speedup.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

## 4. Load balance

In section 3, we emphasized that we have to consider the existing horizontal domain-decomposition method for the GPS RO data distribution, which indicates that each GPS RO profile is assigned to a subdomain based on its geographic location. However, the location of the GPS RO profile is not fixed and changes with every occultation. The geographic distribution of the GPS RO profiles is not even (see Fig. 1), and it is very likely that some processing cores or subdomains get more GPS RO profiles than others. Because the computational cost of the nonlocal operator is very expensive, the load imbalance problem becomes serious with the uneven distribution of the GPS RO data. One may deduce that the overall performance might be determined by the workload of the processing cores, which have the most profiles to process. Figure 4 shows the variation of the maximum number of assigned profiles per subdomain with the number of processing cores (red bars represent experiment Parallel). Because of the uneven geographic distribution of the 106 GPS RO profiles, even if we used 512 processing cores, there is still 1 out of 512 subdomains that covers two GPS RO profiles, and most of the processing cores are idle. A visual comparison of Figs. 2 and 4 for experiment Parallel suggests that the parallel performance has a very high correlation with the maximum number of observations per processing core/subdomain. This implies that the load imbalance might be the bottleneck of the parallel performance. It is impossible to achieve a desirable speedup before solving the load imbalance issue, and assigning observations to processing cores based on geographic location is not a good idea for GPS RO data with a nonlocal operator.

The maximum number of observations per processing core.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

The maximum number of observations per processing core.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

The maximum number of observations per processing core.

Citation: Journal of Atmospheric and Oceanic Technology 31, 9; 10.1175/JTECH-D-13-00195.1

Among the load balance algorithms, round-robin scheduling stands out as both simple and easy to implement. The algorithm assigns observations to every processing core in order, handling all processing cores without priority. Taking advantage of the parallel strategy implemented in section 3, each processing core is aware of the global *N*_{mod}, which means that any processing core can perform the integration of Eq. (2) along any ray path of any profile. The processing of the GPS RO profile does not have to be bonded with a specific processing core/subdomain. Therefore, the load balance strategy is to distribute the GPS RO data profile by profile among processing cores in a round-robin manner, so that each processing core will have roughly an equal number of profiles to process.

The experiment “Loadbalance” in Fig. 2 shows the wall clock time spent with the number of processing cores up to 128. Please note, 106 is the minimum number of processing cores for this case to get the maximum theoretic parallel efficiency, since we have 106 GPS RO profiles. With 128 processing cores, the wall clock time is reduced to 162 s (≈2.5 min) for a five-iteration minimization (the timing results with 106 cores are the same as with 128 cores). Compared to 492 s with 128 processing cores and 279 s with 512 processing cores in section 3, the load balance strategy tremendously increases the parallel efficiency of the nonlocal excess phase GPS RO operator. The experiment Loadbalance in Fig. 3 shows that the speedup with 128 processing cores is 80 times. More precisely, the speedup with 106 processing core is 80 times. Without the load balance strategy, the speedup with 128 processing cores is only 33 times. Again, the high correlation between the wall clock times and the maximum number of observations per processing core of experiment Loadbalance in Figs. 2 and 4 confirms the importance of the load balance strategy in the parallel assimilation of GPS RO data with a nonlocal operator. Given the existing parallel overhead in the WRFDA system, this also proves the idea that the additional collective communications do not impact the parallel performance visibly.

One may argue that the effective number of processing cores is constrained by the number of GPS RO profiles and that some processing cores will be idle if the number of the allocated processing cores is greater than the number of profiles. Please note that, in real-world data assimilation, we also assimilate other observation types. Most of the observations types are distributed to subdomains based on their geographical locations and there is very little chance for a processing core to be really idle. The GPS RO profiles are distributed based on the locations as well, unless the load balance switch is on.

## 5. Further optimization

In variational data assimilation methods, not only is the observation operator needed to calculate the innovation, but also the corresponding tangent linear and adjoint observation operators are required during the minimization. The tangent linear and adjoint operators are used to evolve the perturbations forward and to propagate the cost function gradients backward along the basic trajectory, respectively. Therefore, a common practice to save computational cost is to record the basic trajectory rather than recompute it. In terms of the nonlocal GPS RO operator implementation in the WRFDA system (Chen et al. 2009), the locations of ray paths are recorded during the innovation calculation and the location of each ray path is restored for the calculations of the tangent linear and adjoint operators. The experiment “Optimization” in Fig. 2 shows the wall clock timing results with this further optimization. On average, an additional 4%–10% acceleration is obtained.

## 6. Summary and discussion

The nonlocal excess phase operator for GPS RO data has been demonstrated to be a robust model to simulate the observed GPS excess phase from the model states. However, because of the nonlocal nature of the algorithm, which includes integration along the GPS ray path across the model domain, it is not easy to implement this observation operator in data assimilation systems parallelized based on the popular horizontal domain-decomposition method. Therefore, it has been tested in some research configurations with only a relatively small number of observations.

To parallelize the GPS RO nonlocal excess phase operator, the first strategy is to enable each processing core to be aware of the global simulated model refractivity, which is the only variable needed for the nonlocal operator and is calculated in advance. Thus, each processing core can process the GPS RO profile geographically located within its subdomain along the whole ray paths on each vertical level. However, the performance analysis reveals that the load imbalance associated with the default geographic observation distribution among processing cores seriously constrains the parallel efficiency. Leveraging the implementation of the first strategy, GPS RO profiles can alternatively be distributed among processing cores in a round-robin manner, which ensures the best possible load balance with available computing resources. The demonstration case with 106 GPS RO profiles over an Antarctic domain shows that the wall clock time for the five-iteration minimization with WRFDA was reduced in time from about 3.5 h with one processing core to approximately 2.5 min with 128 processing cores. This is affordable in terms of both research and operational practices. For the COSMIC-2 mission, nearly an order of magnitude more RO atmospheric soundings are expected, and the current parallel implementation and load balance strategy described in this paper should be able to scale well with the number of processing cores because of the high correlation between the maximum number of observations per processing core and the parallel performance.

The “best” parallel strategy will certainly be application dependent. The parallel strategies presented in this paper are identified as the most suitable for the WRFDA system in terms of the code modification, algorithm changes, and parallel performance. There are other approaches we considered and explored at an early stage, and they might be applicable for other data assimilation systems.

The first alternative parallel strategy is to distribute the GPS RO profiles among the subdomains/processors based on the location of the tangent point. For a given ray path of the profile, each subdomain only needs to compute the fraction of excess phase corresponding to that subdomain, and this information would be added together among the subdomains through which the ray passes to get the final result of Eq. (2). This method requires calculating the locations of the ray paths of each profile in advance, and also broadcasting the point-by-point locations of the ray paths to all the subdomains/processors. Therefore, every subdomain/processor knows which ray path passes through its domain. It includes substantial code modification and workflow changes in the WRFDA system, as well as the complicated parallel communication for the locations of ray paths.

Since different GPS RO profiles may include different numbers of ray paths at vertical levels above the tangent point and below the model top, another strategy is to distribute GPS RO data ray by ray, other than profile by profile, among processing cores in a round-robin manner. It may lead to a better load balance solution and will increase the base for distribution of GPS RO data to processors. Because of the constraint of the existing workflow, implementation of the ray-path-wise distribution may include some substantial code changes in WRFDA, and we will explore this potential improvement in the future. However, given the anticipated significant increase of the GPS observation with the launch of the COSMIC-2 mission, the current parallel strategy based on profile distribution might work fine. But for the developments from scratch, the ray-path-wise distribution is the preferred approach to parallel the GPS RO data assimilation with the nonlocal operator.

## Acknowledgments

This work is supported by the National Science Foundation to UCAR for the continued operation of the COSMIC mission, with Grant AGS-1033112, and by the National Space Organization through a UCAR–NSPO AIT–TECRO Agreement. Michael Kavulich helped edit the manuscript. The authors thank three anonymous reviewers for their careful reviews of this paper.

## REFERENCES

Barker, D. M., and Coauthors, 2012: The Weather Research and Forecasting (WRF) Model’s community variational/ensemble data assimilation system: WRFDA.

,*Bull. Amer. Meteor. Soc.***93**, 831–843, doi:10.1175/BAMS-D-11-00167.1.Chen, S.-Y., Huang C.-Y. , Kuo Y.-H. , Guo Y.-R. , and Sokolovskiy S. , 2009: Assimilation of GPS refractivity from FORMOSAT-3/COSMIC using a nonlocal operator with WRF 3DVAR and its impact on the prediction of a typhoon event.

,*Terr. Atmos. Oceanic Sci.***20**, 133–154, doi:10.3319/TAO.2007.11.29.01(F3C).Cucurull, L., and Derber J. C. , 2008: Operational implementation of COSMIC observations into NCEP’s Global Data Assimilation System.

,*Wea. Forecasting***23**, 702–711, doi:10.1175/2008WAF2007070.1.Cucurull, L., Kuo Y.-H. , Barker D. , and Rizvi S. R. H. , 2006: Assessing the impact of simulated COSMIC GPS radio occultation data on weather analysis over the Antarctic: A case study.

,*Mon. Wea. Rev.***134**, 3283–3296, doi:10.1175/MWR3241.1.Healy, S. B., 2008: Forecast impact experiment with a constellation of GPS radio occultation receivers.

,*Atmos. Sci. Lett.***9**, 111–118, doi:10.1002/asl.169.Healy, S. B., and Thepaut J. N. , 2006: Assimilation experiments with CHAMP GPS radio occultation measurements.

,*Quart. J. Roy. Meteor. Soc.***132**, 605–623, doi:10.1256/qj.04.182.Healy, S. B., Jupp A. M. , and Marquardt C. , 2005: Forecast impact experiment with GPS radio occultation measurements.

*Geophys. Res. Lett.,***32,**L03804, doi:10.1029/2004GL020806.Huang, C.-Y., Kuo Y.-H. , Chen S.-H. , and Vandenberghe F. , 2005: Improvements on typhoon forecast with assimilated GPS occultation refractivity.

,*Wea. Forecasting***20**, 931–953, doi:10.1175/WAF874.1.Kuo, Y.-H., Zou X. , and Huang W. , 1998: The impact of global positioning system data on the prediction of an extratropical cyclone: An observing system simulation experiment.

,*Dyn. Atmos. Oceans***27**, 439–470, doi:10.1016/S0377-0265(97)00023-7.Kuo, Y.-H., Sokolovskiy S. V. , Anthes R. A. , and Vandenberghe F. , 2000: Assimilation of GPS radio occultation data for numerical weather prediction.

,*Terr. Atmos. Oceanic Sci.***11**, 157–186.Kuo, Y.-H., Wee T.-K. , Sokolovskiy S. , Rocken C. , Schreiner W. , Hunt D. , and Anthes R. A. , 2004: Inversion and error estimation of GPS radio occultation data.

,*J. Meteor. Soc. Japan***82**, 507–531, doi:10.2151/jmsj.2004.507.Kuo, Y.-H., Schreiner W. S. , Wang J. , Rossiter D. L. , and Zhang Y. , 2005: Comparison of GPS radio occultation soundings with radiosondes.

*Geophys. Res. Lett.,***32,**L05817, doi:10.1029/2004GL021443.Liu, H., and Zou X. , 2003: Improvements to GPS radio occultation ray-tracing model and their impacts on assimilation of bending angle.

,*J. Geophys. Res.***108**, 4548, doi:10.1029/2002JD003160.Liu, H., Anderson J. , Kuo Y.-H. , Snyder C. , and Caya A. , 2008: Evaluation of a nonlocal quasi-phase observation operator in assimilation of CHAMP radio occultation refractivity with WRF.

,*Mon. Wea. Rev.***136**, 242–256, doi:10.1175/2007MWR2042.1.Ma, Z., Kuo Y.-H. , Wang B. , Wu W.-S. , and Sokolovskiy S. , 2009: Comparison of local and nonlocal observation operators for the assimilation of GPS RO data with the NCEP GSI system: An OSSE study.

,*Mon. Wea. Rev.***137**, 3575–3587, doi:10.1175/2009MWR2809.1.Poli, P., Moll P. , Puech D. , Rabier F. , and Healy S. , 2009: Quality control, error analysis and impact assessment of FORMOSAT-3/COSMIC in numerical weather prediction.

,*Terr. Atmos. Oceanic Sci.***20**, 101–113, doi:10.3319/TAO.2008.01.21.02(F3C).Rennie, M. P., 2010: The impact of GPS radio occultation assimilation at the Met Office.

,*Quart. J. Roy. Meteor. Soc.***136**, 116–131, doi:10.1002/qj.521.Shao, H., Zou X. , and Hajj G. A. , 2009: Test of a non-local excess phase delay operator for GPS radio occultation data assimilation.

*J. Appl. Remote Sens.,***3,**033508, doi:10.1117/1.3094060.Skamarock, W. C., and Coauthors, 2008: A description of the Advanced Research WRF version 3. NCAR Tech. Note NCAR/TN-475+STR, 113 pp. [Available online at http://www.mmm.ucar.edu/wrf/users/docs/arw_v3_bw.pdf.]

Smith, E. K., and Weintraub S. , 1953: The constants in the equation for atmospheric refractivity index at radio frequencies.

,*Proc. IRE***41,**1035–1037, doi:10.1109/JRPROC.1953.274297.Sokolovskiy, S., Kuo Y.-H. , and Wang W. , 2005: Evaluation of a linear phase observation operator with CHAMP radio occultation data and high-resolution regional analysis.

,*Mon. Wea. Rev.***133**, 3053–3059, doi:10.1175/MWR3006.1.Zhang, X., Liu Y. , Wang B. , and Ji Z. , 2004: Parallel computing of a variational data assimilation model for GPS/MET observation using the ray-tracing method.

,*Adv. Atmos. Sci.***21**, 220–226, doi:10.1007/BF02915708.Zou, X., and Coauthors, 1999: A ray-tracing operator and its adjoint for the use of GPS/MET refraction angle measurements.

,*J. Geophys. Res.***104**, 22 301–22 318, doi:10.1029/1999JD900450.Zou, X., Wang B. , Liu H. , Anthes R. A. , Matsumura T. , and Zhu Y.-J. , 2000: A ray-tracing operator and its adjoint for the use of GPS/MET refraction angle measurements.

,*Quart. J. Roy. Meteor. Soc.***126**, 3013–3040, doi:10.1002/qj.49712657003.