OTCliM: Generating a Near-Surface Climatology of Optical Turbulence Strength (Cn2) Using Gradient Boosting

Maximilian Pierzyna Department of Geoscience and Remote Sensing, Delft University of Technology, Delft, Netherlands

Search for other papers by Maximilian Pierzyna in
Current site
Google Scholar
PubMed
Close
https://orcid.org/0000-0002-0596-8374
,
Sukanta Basu Atmospheric Sciences Research Center, University at Albany, State University of New York, Albany, New York
Department of Environmental and Sustainable Engineering, University at Albany, State University of New York, Albany, New York

Search for other papers by Sukanta Basu in
Current site
Google Scholar
PubMed
Close
, and
Rudolf Saathof Faculty of Aerospace Engineering, Delft University of Technology, Delft, Netherlands

Search for other papers by Rudolf Saathof in
Current site
Google Scholar
PubMed
Close
Open access

Abstract

This study introduces Optical Turbulence Climatology Using Machine Learning (OTCliM), a novel approach for deriving comprehensive climatologies of atmospheric optical turbulence strength (Cn2) using gradient boosting machines. OTCliM addresses the challenge of efficiently obtaining reliable site-specific Cn2 climatologies near the surface, crucial for ground-based astronomy and free-space optical communication. Using gradient boosting machines and global reanalysis data, OTCliM extrapolates 1 year of measured Cn2 into a multiyear time series. We assess OTCliM’s performance using Cn2 data from 17 diverse stations in New York State, evaluating temporal extrapolation capabilities and geographical generalization. Our results demonstrate accurate predictions of four held-out years of Cn2 across various sites, including complex urban environments, outperforming traditional analytical models. Nonurban models also show good geographical generalization compared to urban models, which capture nongeneral site-specific dependencies. A feature importance analysis confirms the physical consistency of the trained models. It also indicates the potential to uncover new insights into the physical processes governing Cn2 from data. OTCliM’s ability to derive reliable Cn2 climatologies from just 1 year of observations can potentially reduce resources required for future site surveys or enable studies for additional sites with the same resources.

© 2025 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Maximilian Pierzyna, m.pierzyna@tudelft.nl

Abstract

This study introduces Optical Turbulence Climatology Using Machine Learning (OTCliM), a novel approach for deriving comprehensive climatologies of atmospheric optical turbulence strength (Cn2) using gradient boosting machines. OTCliM addresses the challenge of efficiently obtaining reliable site-specific Cn2 climatologies near the surface, crucial for ground-based astronomy and free-space optical communication. Using gradient boosting machines and global reanalysis data, OTCliM extrapolates 1 year of measured Cn2 into a multiyear time series. We assess OTCliM’s performance using Cn2 data from 17 diverse stations in New York State, evaluating temporal extrapolation capabilities and geographical generalization. Our results demonstrate accurate predictions of four held-out years of Cn2 across various sites, including complex urban environments, outperforming traditional analytical models. Nonurban models also show good geographical generalization compared to urban models, which capture nongeneral site-specific dependencies. A feature importance analysis confirms the physical consistency of the trained models. It also indicates the potential to uncover new insights into the physical processes governing Cn2 from data. OTCliM’s ability to derive reliable Cn2 climatologies from just 1 year of observations can potentially reduce resources required for future site surveys or enable studies for additional sites with the same resources.

© 2025 American Meteorological Society. This published article is licensed under the terms of the default AMS reuse license. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Maximilian Pierzyna, m.pierzyna@tudelft.nl

1. Introduction

Atmospheric optical turbulence is highly relevant for optical ground-based astronomy and future free-space optical communication (FSOC). Both applications suffer from light getting distorted when propagating through the turbulent atmosphere. In astronomy, turbulent fluctuations of the atmospheric refractive index, known as optical turbulence (OT), cause blurry images and limit the detection of small objects (Hardy 1998). FSOC links, which use optical beams to transmit data instead of traditional radio waves, experience reduced data rates or even link interruptions due to OT (Kaushal and Kaddoum 2017; Jahid et al. 2022). Therefore, the optical turbulence strength (Cn2, where index n denotes the refractive index) must be carefully considered in the design and operation of such optical systems. This requires robust statistical evaluation of Cn2 over time for the operational sites of interest, known as Cn2 climatology. Such a climatology, derived from long-term Cn2 data, portrays trends, seasonal variations, or potential anomalies in OT strength.

Obtaining Cn2 climatologies is challenging because turbulence strongly depends on the local topography and varying meteorological conditions. That is why long-term on-site surveys measuring local OT conditions are crucial. For instance, before a new optical observatory is built, OT strength is typically measured for multiple years at a few carefully selected locations to identify the best one (see, e.g., Hill et al. 2006; Schöck et al. 2009). When envisioning a global FSOC network, site surveys have so far been focused on cloud cover (see, e.g., Fuchs and Moll 2015; Poulenard et al. 2015; Pham et al. 2023), but similar site surveys targeting OT are expected to become highly relevant in the future.

However, conducting long-term site surveys at various locations of interest is time consuming and resource intensive. While Cn2 can be obtained by postprocessing numerical weather model output (e.g., Masciadri et al. 1999), running such models for multiple years is computationally expensive, and the accuracy of the resulting Cn2 is very sensitive to the model configuration (Pierzyna et al. 2023a) and the selected Cn2 parameterization (Pierzyna et al. 2024). We address these issues by proposing a novel machine learning (ML)-based approach called OT Climatology Using ML (OTCliM). OTCliM aims to extrapolate just 1 year of measured near-surface Cn2 into a multiyear time series, enabling the generation of a comprehensive site-specific Cn2 climatology with less data than a conventional site survey. Our approach does not yet provide full vertical Cn2 profiles as often sought in astronomy or FSOC. But for near-surface conditions, which are highly relevant for, e.g., horizontal near-surface FSOC links, site survey costs can potentially be reduced, climatologies can be obtained faster, and more sites can be surveyed within a given time frame. By leveraging ML, OTCliM can also model complex input–output relations, offering an advantage over traditional empirical Cn2 models, which are discussed in the following section.

This study assesses OTCliM’s performance using an extensive Cn2 dataset containing measurements from 17 diverse stations. We evaluate the temporal extrapolation capabilities of the trained models and analyze their potential for geographical generalization to other sites. Additionally, we examine the importance of each input variable for predicting Cn2 to probe if the ML models have learned physically plausible relations. That analysis increases confidence in the ML models and could uncover new dependencies of Cn2 from the data.

2. Relevant Cn2 regression studies

The experimental determination of Cn2 is challenging because it requires expensive instruments, such as scintillometers (Beyrich et al. 2021) or meticulous postprocessing of high-frequency temperature measurements (Beason et al. 2024). Consequently, there has been a sustained effort over several decades to develop parameterizations of Cn2 using more readily available meteorological variables. Conventional approaches to Cn2 parameterization typically involve the application of physics-based analytical equations in conjunction with empirically derived coefficients or regression functions. A popular example is the formulation proposed by Wyngaard et al. (1971), based on Monin–Obukhov similarity theory (MOST) (Monin and Obukhov 1954). That parameterization—called W71 in the following—relates sensible heat flux wθ¯ and friction velocity u* to the strength of temperature fluctuations (CT2) at height z as
CT2=(wθ¯/u*)2z2/3g(ζ).
Sensible heat flux wθ¯ and friction velocity u*=(uw¯2+υw¯2)1/4 describe the vertical (i.e., surface-normal) turbulent transport of heat (θ′) and momentum (u′, υ′) due to fluctuations of the vertical wind component (w′) (Stull 1988). As these parameters capture the two effects modulating turbulence, buoyancy (sensible heat flux), and wind shear (friction velocity), they are commonly used in turbulence parameterizations. Since refractive index fluctuations (Cn2) are driven by density fluctuations due to temperature and moisture fluctuations, Cn2 is linked to CT2 as (Wesely 1976; Moene 2003)
Cn2=(APT¯2)2(1+0.03β)2CT2,
with A ≈ 7.9 × 10−5 K hPa−1 for optical wavelengths (Andreas 1988), station pressure P in hectopascal (hPa), and mean air temperature T¯ in Kelvin (K). The contribution of moisture fluctuations to Cn2 is captured by the Bowen ratio β, the ratio of the sensible heat flux and the latent heat flux due to evaporation. If Cn2 is estimated above the boundary layer in the free atmosphere, a modified formulation of Eq. (2) can be used (Cherubini and Businger 2013).
Function g(ζ) in Eq. (1) is a similarity function, empirically determined from observations,
g(ζ)=4.9(16.1ζ)2/3,ζ<0g(ζ)=4.9(1+2.2ζ2/3),ζ0,
with stability parameter ζ = z/L, where L is the Obukhov length L=u*3T¯/(κgwθ¯) and z is the height above ground. In L, g = 9.81 m s−2 is the gravitational acceleration of Earth and κ = 0.4 is the Von Kármán constant. Various alternative regression-based similarity functions g(ζ) were proposed in literature aimed at improving the accuracy and applicability of Cn2 estimates across diverse meteorological and topographic conditions [see, e.g., Savage (2009) for an extensive review].

In recent years, different ML techniques have been utilized to derive Cn2 parameterizations. Similar to W71, ML-based approaches aim to obtain regression models that estimate Cn2 from routine meteorological variables (temperature, pressure, wind speed), gradients (potential temperature gradient or wind shear), or heat and radiation fluxes. However, instead of deriving physics-based functional expressions [cf. Eqs. (1) and (2)], which are fitted to observations [cf. Eq. (3)], ML directly models the relation between observed Cn2 and the input variables. The power of such ML regression models is that they can model complex multivariate relations, yielding potentially better Cn2 estimates compared to traditional approaches.

The ML-based Cn2 parameterizations in literature differ in their input variables, the ML regression technique employed, and if Cn2 is measured at a single level or at multiple vertical levels. Much work has been devoted to single-level near-surface Cn2 obtained as point measurements (e.g., from sonic anemometer) or via path-averaging (e.g., from scintillometer). Wang and Basu (2016), for example, used fully connected neural networks to estimate single-level Cn2 from routine meteorological data and gradients. Vorontsov et al. (2020) also utilized neural networks but aimed at estimating path-averaged Cn2 from images of a received laser beam that is broken up into speckles due to turbulence. Jellen et al. (2020, 2021) employed random forest and gradient boosting machine (GBM) to estimate path-averaged Cn2 in a maritime environment from routine parameters and radiation fluxes. Most ML-based models do not yield analytical equations of the fitted model, unless they explicitly target it as shown by Bolbasova et al. (2021). The authors obtained a complex site-specific equation for single-level Cn2, similar to non-ML empirical models proposed by, e.g., Sadot and Kopeika (1992), Wang et al. (2015), or Arockia Bazil Raj et al. (2015). The difference between these empirical models and MOST-based models is that they are not derived physically but aim to capture the local optical turbulence strength well. Going beyond single-level Cn2 estimates, Pierzyna et al. (2023b) proposed a physics-inspired framework to derive a data-driven similarity theory of optical turbulence that combines Cn2 observations from multiple levels within the surface layer into a single nondimensional model. In contrast to the near-surface studies, Su et al. (2021) proposed a method to estimate a vertical Cn2 profile by modeling OT strength as a function of radiosonde measurements. Cherubini et al. (2021) presented an ML framework to estimate seeing, which is the vertical integral of the Cn2 profile, close to the surface and in the free atmosphere, but does not explicitly resolve vertical levels. Milli et al. (2020) tackled yet another estimation angle by using ML to forecast future OT conditions for the next 2 h based on past observations and auxiliary features. However, all these approaches require in situ data, rendering them unsuitable for temporal extrapolation as targeted in the present study. Additionally, in situ–based models need sensors deployed at the site of interest for Cn2 estimation, which hinders their applicability to sites without instrumentation. To alleviate this issue, we proposed OTCliM, which parameterizes Cn2 based on reanalysis data, which are available globally and for multiple decades into the past.

3. Methodology

Our proposed OTCliM approach aims to extrapolate 1 year of observed Cn2 to multiple years to obtain a robust statistical description of OT strength at a particular site. An overview of OTCliM is given in Fig. 1, which we base on the measure–correlate–predict (MCP) framework popular in wind energy (Carta et al. 2013; Kartal et al. 2023). In the first step, Cn2 is measured for 1 year (golden yellow) and then correlated to a reference dataset. For OTCliM, we utilize variables from the ERA5 reanalysis (Hersbach et al. 2020) extracted at the location of the Cn2 measurements as reference data and GBM to regress the ERA5 time series to Cn2 where they overlap (blue). In the second step, the trained model is utilized to extrapolate Cn2 based on ERA5 to multiple years (prediction step in MCP), which can then be used to obtain site-specific seasonal statistics as presented in step 3.

Fig. 1.
Fig. 1.

Proposed OTCliM approach to extrapolate a measured 1-yr time series of OT strength (golden yellow) to multiple years (orange) based on ERA5 reference data (blue). Robust yearly Cn2 statistics can be obtained from the extrapolated data.

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0076.1

We utilize GBMs for the regression step (cf. section 3a) because they are known for their high performance in nonlinear tabular regression tasks. For comparison, we also train GBM models that use in situ observations instead of ERA5 as input data, which serve as performance baselines for OTCliM. Similarly, the traditional W71 model is evaluated with ERA5 inputs to provide a second baseline (cf. section 3b). To ensure our GBM models capture physically reasonable dependencies, we quantify the feature importance assigned to each ERA5 input variable (cf. section 3c). Finally, the performance of the trained models is assessed in terms of their temporal extrapolation and geographical generalization capabilities (cf. section 3d).

a. Gradient boosting regression

Following the MCP framework presented in Fig. 1, we aim to train one ML model for one site using training data from 1 year. The corresponding training data are the input data Xs,tRn×p and the target vector ys,tRn, where sS is one out of multiple sites S and tT is one out multiple training years T available at that site. The ML task is to learn a regression function f^s,t that approximates a sample yiys,t from the target vector based on a sample xiXs,t of the input data: f^s,t(xi)=y^iyi. The Xs,t and ys,t contain a matching number of n samples, i.e., the samples of the 1-yr time series, where Xs,t is composed of p features, i.e., concurrent time series of different meteorological variables, and ys,t contains the scaled log10Cn2.

The OT strength Cn2 is log10-transformed because it varies over multiple orders of magnitude throughout the day, which is challenging to capture in non-log space. Also, the range of yearly Cn2 variation varies between sites, so log10Cn2 is scaled before training to make the performance scores and feature importance values comparable between sites. More details about Cn2 preprocessing are presented in section 4a.

The GBM regression models f^s,t are trained using the AutoML library Fast and Lightweight AutoML (FLAML) (Wang et al. 2021). FLAML is a time-constrained hyperparameter optimizer that aims to find the optimal GBM model configuration within a specified time budget. FLAML optimizes not only the hyperparameters of a single algorithm but also explores switching between algorithms: here, the two popular algorithms extreme gradient boosting (XGBoost) (Chen and Guestrin 2016) and light GBM (LightGBM) (Ke et al. 2017). The internal loss functions of XGBoost and LightGBM are also treated as hyperparameters to be optimized by FLAML with L1 norm, L2 norm, and Huber norm as options. The hyperparameter optimization uses fivefold cross-validation based on the training data and aims to minimize the root-mean-square error between the predicted and hold-out y. Each model is trained for 45 min on 8 CPU cores of an Intel Xeon 2648R, resulting in six core hours per model.

b. Baseline models

We employ two baseline models to put the performance of OTCliM into perspective. The first model is the traditional empirical W71 model as given by Eqs. (1) and (3). Since the similarity function g(ζ) does not contain site-specific model coefficients, the accuracy of the W71 Cn2 estimates is expected to be lower than the site-specific OTCliM models. Second, we utilize GBM models trained with in situ observations as input instead of reanalysis. Reanalysis datasets are coarse-resolution model outputs [spatial resolution, typically O(10)km], which are expected to miss some local and complex processes and patterns modulating turbulence. Such processes would be contained in the in situ data, so the performance of in situ GBM models is expected to be higher than that of ERA5-based OTCliM models. These baselines allow us to disentangle the influence of the modeling approach (traditional versus GBM) from that of the input data (ERA5 vs in situ). Appendix C gives more technical details about both baseline models.

c. Physical consistency checking using feature importance

To assess if the trained models captured physically consistent relationships and to potentially discover currently unknown dependencies between features and the regression target, we quantify the importance of each feature for the prediction using Shaply additive explanation (SHAP) values (Lundberg and Lee 2017). SHAP values explain how a trained model f^s,t arrives at its predictions y^ on a per-sample basis. For each sample xi, the prediction y^i is explained as
f^s,t(xi)=y^i=E[y^i]+j=1pϕj,i,
where ϕj,i are the SHAP values of the ith sample and E[y^i] is the expected value of all predictions. In plain words, the jth SHAP value ϕj,i describes the contribution of the jth feature to the prediction assuming a local linear model. A large step ϕj,i compared to the other SHAP values indicates that the jth feature contributes strongly to that prediction and must be important. Consequently, a global view of the importance of the jth feature can be obtained by averaging the jth SHAP values for all n samples of the dataset (Molnar 2022) such that
Ij=1ni=1n|ϕj,i|.

Note that SHAP values explain a feature’s contribution to a model’s prediction y^i, not to the true value yi. Accurate feature importance (FI) estimates, thus, require well-performing models f^s,t, where y^i closely approximates yi for training and testing data. As predictions typically align more closely with true values in training data than in testing data, we compute SHAP values using the former. This approach allows us to examine the model’s internal workings without the influence of the generalization error. Moreover, it aligns with practical scenarios where all available data are used for training the final model, meaning that the FI analysis must be performed using the training dataset (Molnar 2022).

d. Performance evaluation

The performance of the trained OTCliM models is quantified through the Pearson correlation coefficient r and the root-mean-square error (RMSE) ϵ:
r=i=1n(yiy¯)(y^iy^¯)nσyσy^,
ϵ=1ni=1n(yiy^i)2.

In Eq. (6), the overbar denotes the mean and σ is the standard deviation of true (y) and estimated (y^) scaled Cn2, respectively. Two validation strategies are employed to assess the temporal and geographical extrapolation capabilities of the OTCliM models. Note that only the testing data used to compute the scores differ between the evaluation strategies—the training data and the trained model remain the same.

1) Temporal extrapolation

For the MCP application, the temporal extrapolation, the model f^s,t trained on data from year tT needs to accurately predict OT strength at the same site for the held-out testing years T\t: f^s,t(xi)=y^i with xiXs,T\t and y^y^s,T\t. The scores are computed between the true held-out data ys,T\t and the predictions y^s,T\t and are denoted s,t, where is a placeholder for the performance metrics r and ϵ. Training happens in a round-robin fashion, so one model is trained per training year and evaluated for the hold-out years. This process is repeated until all years are used for training once. The average performance of these models for a specific station s, called the MCP score, is obtained by averaging the scores across all training years: ¯s=s,ttT.

2) Cross-site evaluation

The second evaluation strategy probes the geographical generalization capability of a model trained on station s when applied to another station s˜S\s. The trained model is tasked to predict OT strength at s˜ given the full multiyear input data Xs˜,T at s˜ from which the cross-site scores using r and RMSE can be obtained: (s,t)(s˜,T). For readability, we average the cross-site scores achieved by the individual models per training site s: ss˜=(s,t)(s˜,T)tT.

4. Dataset

To evaluate OTCliM’s performance, we train GBM models for 17 diverse locations across New York State (cf. section 4a). Five years of Cn2 are available at each site for which we select collocated and concurrent ERA5 data (cf. section 4b).

a. NYSM

The New York State Mesonet (NYSM) comprises 127 standard weather stations as of 2024 spread across New York State, the United States, and has been fully operational since 2018 (Brotzge et al. 2020). These weather stations measure routine meteorological parameters such as 2 m-temperature, 10 m-wind speed and direction, surface pressure, and several other variables. The sampling rate of these measurements is on the order of seconds, and final values are reported as 10-min aggregates (mean and variance). However, measuring Cn2 requires high-frequency observations to resolve the inertial range of turbulence. Such high-frequency measurements are available at a subset of 17 stations—the flux stations—additionally equipped with Campbell Scientific CSAT3A sonic anemometers mounted at 9-m height. The NYSM sonic anemometers measure the three wind components and the sonic temperature at fs = 10 Hz (Brotzge et al. 2020), which is high enough to obtain Cn2.

We utilize the data from these NYSM flux stations to obtain training targets for the OTCliM models. The flux stations are placed in diverse topographical and climatological environments as listed in Table 1. The stations Brooklyn (BKLN), Queens (QUEE), and Staten Island (STAT) are located on rooftops in urban environments where measurements are strongly influenced by their immediate surroundings. Neighboring buildings can, for example, cast shadows or cause wakes, which influence radiation, wind, and, therefore, also the local turbulence (WMO 2023). Since we expect these urban stations to behave differently than the rural stations, they are marked with (*) in the table. For each of the 17 stations, 5 years of measurements are available. Following the notation introduced previously, the set of flux stations is denoted as S, and T represents the set of five training years 2018–22 available at each site. The corresponding target vector ys,t is obtained by estimating Cn2 from sonic anemometer measurements, applying quality assurance (QA) and quality control (QC) steps, and scaling the OT data to make them comparable between different sites. These stages are described in detail below.

Table 1.

Flux stations of the New York State Mesonet used to benchmark the OTCliM approach. The three urban stations are marked with asterisk (*).

Table 1.

1) Structure function approach

As noted in the context of Eq. (2), Cn2 quantifies the strength of refractive index fluctuations due to density fluctuations caused by turbulent temperature and moisture fluctuations. The temperature Ts measured by sonic anemometers also contains a humidity contribution as Ts = T(1 + 0.51q) with specific humidity q (Kaimal and Gaynor 1991). We assume that using the simplified version
Cn2(AP/T¯2)2CTs2
of Eq. (2) implicitly accounts for moisture if the strength of these sonic temperature fluctuations CTs2 is used. That is because CTs2 can be decomposed as CTs2CT2+1.02T¯CTq+0.26T¯2Cq2, with CT2 and Cq2 representing the strength of pure temperature and moisture fluctuations and CTq representing the cross-term between the two. For readability, we will refer to CTs2 simply as CT2 but would like to reiterate that a moisture contribution is included implicitly.
For Δx within the inertial range, CT2 is the coefficient of the second-order structure function of the (sonic) temperature defined as
ST2(Δx)=[T(x)T(x+Δx)]2=CT2Δx2/3.

The structure function is computed from 5-min nonoverlapping windows of the sonic temperature signal and CT2 is obtained by fitting a 2/3 slope to the inertial range of ST2 in log–log coordinates.

2) QA/QC

Since our GBM models are purely data-driven, the quality of the fitted model strongly depends on the quality of the training data, so we apply several QA/QC steps to obtain a clean dataset. First, CT2 has to be determined from the slope of the inertial range, i.e., slopes of 2/3. Since the inertial range is not always resolved or well captured in every 5-min window, only CT2 values with well-fitted slopes are kept (2/3 ± 5% slope, R2 > 0.95). A similar procedure was followed by He and Basu (2016) for the analysis of simulated data. Second, ST2 in Eq. (9) is defined spatially as a function of separation distance Δx, but the sonic anemometer measures T temporally. Taylor’s frozen turbulence assumption is commonly made to convert the temporal sonic signal into a spatial form using the horizontal mean wind speed M¯ and Δx=M¯/fs. However, Taylor’s assumption might break down in low wind conditions, so all CT2 values for M¯<1ms1 are discarded. Finally, precipitation strongly affects sonic temperature measurements (Zhang et al. 2016), so only CT2 values for dry conditions can be used. This filter is also justified from an operational perspective because optical telescopes are typically not operated in the rain, and FSOC links are strongly attenuated by snow [International Telecommunication Union (ITU) 2007].

Being an observed dataset, the sonic temperature signal contains gaps due to, e.g., power or communication outages or instrument malfunctioning (NYS Mesonet 2023), leading to ∼16% of missing values on average. That leaves ∼84% of the 5-min nonoverlapping windows gap-free and suitable for the computation of CT2. The three QA/QC steps further reduce the average number of valid samples, leaving ∼66% after inertial range fitting, ∼55% after ensuring the applicability of the Taylor assumption, and ∼53% after discarding precipitation events. Although only half of the whole dataset is considered valid, it still contains ca. 50 000 samples per site and training year spread throughout diverse meteorological conditions. This claim is supported in appendix A, where we present that the QA/QC procedure generally maintains the ratio between stable and unstable atmospheric conditions before and after QA/QC. Thus, the final dataset is considered adequate for training the OTCliM models.

3) Distribution scaling

The quality-controlled CT2 time series is converted to Cn2 using the Gladstone equation [Eq. (2)] and then scaled. The scaling is needed because the yearly distribution of OT strength differs between sites. Some locations show stronger OT (shifted distribution) or a larger range of Cn2 values (scaled distribution) than other sites. To make the performance scores and SHAP FI values comparable between sites, we scale the site-specific log10Cn2 before training based on the 50% log range such that
[log10Cn2(t)]scaled=log10Cn2(t)[log10Cn2]p25[log10Cn2]p75[log10Cn2]p25.

Here, the subscripts p25 and p75 indicate the 25th and 75th percentiles of the original distribution used for scaling. More details are given in appendix B.

b. ERA5 reanalysis data

The ERA5 reanalysis (Hersbach et al. 2020) serves as the input dataset for OTCliM from which Cn2 shall be estimated. ERA5 is available globally from 1950 until the present and, thus, includes our NYSM training locations and times. We extract ERA5 input data for all NYSM stations and training years at the ERA5 grid points closest to the respective stations. ERA5 grid points represent 1/4° grid boxes (ca. 30 km × 30 km), so by selecting the closest point, we obtain data representative for the grid box containing the station. Figure 2 depicts the location of the flux stations (gray markers) and their collocated grid boxes (orange square) and also gives an impression of how the coarse ERA5 represents (Fig. 2a) land, sea, and (Fig. 2b) terrain. The temporal resolution/sampling rate also differs between ERA5 (1 h) and the observed Cn2 dataset (5 min). ERA5 data are instantaneous snapshots available at each full hour, so we match each ERA5 sample with the average of the six 5-min Cn2 values ±15 min around the hour. This 30-min average centered around the hour smooths the Cn2 signal to counteract potential temporal misalignment between observations and reanalysis.

Fig. 2.
Fig. 2.

ERA5 representation of NYSM domain with true locations of NYSM flux stations (gray) and corresponding ERA5 grid box (orange) containing the stations. Urban sites are marked with (*).

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0076.1

We aim to incorporate all variables linked to the two processes driving atmospheric turbulence: wind shear and buoyancy. Commonly used variables include sensible heat flux and friction velocity, as utilized in W71. However, an advantage of using ML over deriving analytical equations from theory is the ability to include variables that may indirectly influence turbulence. For example, the ERA5 gravity wave dissipation (GWD) rate could be significant in complex mountainous regions where orographic gravity wave drag is known to modulate momentum fluxes (Lilly 1972; Palmer et al. 1986). If there is a relationship between GWD and Cn2, ML models will likely identify and utilize it, potentially revealing new dependencies.

A complete list of the ERA5 variables selected as features is presented in Table 2. Many listed features are (partially) redundant and/or (partially) correlated. Since it is initially unknown which features are most suitable for estimating near-surface Cn2, we aim for a broad feature space. During training, the GBM algorithm will identify and base its predictions on the most important features. Through the posttraining feature importance analysis, we can identify which features the trained model deems relevant and assess their physical relevance. It is usually preferred in ML to minimize the number of features and reduce model complexity. However, this concern does not apply to tree-based algorithms like GBM used in this study because such algorithms only consider features during fitting, which increase the model accuracy. Consequently, features contributing not or little to the model’s accuracy are not selected (for splitting) while the trees are grown (Chen and Guestrin 2016), so their impact on model complexity is low (Spiliotis 2022).

Table 2.

ERA5 variables and variables derived thereof serve as features for the OTCliM approach. Derived features do not have ERA5 variable names and are marked with (—).

Table 2.

The features in Table 2 without ERA5 variable names are so-called engineered features, meaning that they are derived from one or more ERA5 variables. In the following, we detail the variable selection and the feature engineering.

1) Shear-related features

Table 2 lists multiple wind-related features aimed at capturing wind shear, i.e., the vertical change of the wind. Like traditional Cn2 parameterizations, such as W71, we select the friction velocity u* as the first shear-related feature. We also include the horizontal wind speed obtained from the zonal and meridional wind fields of ERA5, Uz and Vz, as
Mz=Uz2+Vz2
at z = 10 m and z = 100 m above ground. Assuming a power-law wind profile of the form M(z) = Mref(z/zref)α between the 10- and 100-m level, we utilize the exponent of the power law α as an additional shear-related feature:
α(z1,z2)=logMz2logMz1logz2logz1.
The directional shear, i.e., wind turning with height, is included through the absolute angular difference between 10- and 100-m wind direction defined as β = |X10X100|, where the wind direction Xz is given as
Xz=arctan2(Uz,Vz).

The wind direction also serves as a proxy for upstream effects, or fetch, that might influence the observed turbulence. For example, atmospheric turbulence measured at a station close to the coast can be very different if the wind blows from land to sea or from sea to land. The periodicity of Xz is accounted for by including the sines and cosines of Xz as fetch features instead of Xz directly.

2) Buoyancy and stability

The prime candidate to reflect the influence of buoyancy on Cn2 is the sensible heat flux QH. Sensible heat flux is also featured in W71 because it captures buoyancy and static atmospheric stability (in ERA5’s convention, QH > 0 during stable/nighttime conditions and QH < 0 during unstable/daytime conditions). Additionally, static stability can be estimated from temperature gradients, so we utilize the absolute values of 2 m-temperature, skin temperature, and soil temperature, as well as the differences between them as complementary features.

3) Surface energy budget

The strength of buoyancy depends on the radiation that reaches the surface, so we also include ERA5 radiation fluxes and cloud cover as features to complement the buoyancy information from QH. The simplified surface energy budget describes how radiation forces the surface fluxes. If the small soil heat flux into the ground is neglected, the steady-state energy balance at the surface is (Stull 1988)
QL+QHRS+RL,
where QL is the latent heat flux due to evaporation and RS and RL are the net shortwave and longwave surface radiation fluxes. The net fluxes R=RR are the difference between incoming/downwelling (index ↓) and the reflected/upwelling radiation (index ↑) of shortwave (=S) and longwave (=L) radiation. ERA5 contains the net and downward radiation fluxes as variables. Further, it splits the shortwave downwelling radiation RS into a direct component RS↓,dir and a diffuse component RS↓,diff resulting from scattering by clouds. Since clouds affect how much shortwave radiation reaches the surface and how much longwave radiation from the surface is reflected, we include ERA5’s low cloud cover (z < ca. 2 km) and total cloud cover into the dataset. Close to water bodies, QL is expected to be higher due to stronger evaporation leading to lower QH, and, thus, lower Cn2. Therefore, QL and other moisture-related ERA5 variables are listed as moisture-related features in Table 2.

4) Auxiliary features

Finally, we include several auxiliary features more loosely related to wind shear and buoyancy, such as boundary layer dissipation (BLD) rate, convective available potential energy (CAPE), or the aforementioned GWD. Also, certain daily and seasonal patterns exist in meteorology, which we aim to capture through synthetic time-dependent features. Based on the timestamp of the data point, we compute the sines and cosines (for periodicity) of the normalized hour of the day (h′ = 2π h/24 h), the normalized day of the year (day′ = 2π day/365 days), and the normalized month of the year (month′ = 2π month/12 month).

5. Results: Feature importance

The physics governing optical turbulence are the same everywhere—in urban or rural environments or for in-land stations or coastal sites. The modulating processes are always buoyancy and wind shear. However, the ERA5 features that best reflect and predict these processes locally can differ between sites. To assess and quantify potential differences, we present the SHAP-based feature importance values for all trained models in Fig. 3. To make the FI analysis less verbose, we make use of the linearity of the SHAP values [cf. Eq. (4)] and group the SHAP values of features related to similar physical processes as presented in Table 2. These groups are prefixed with Σ.

Fig. 3.
Fig. 3.

SHAP value–based FI of all OTCliM models (a) aggregated and (b) per training site. In (b), urban stations are marked with (*), the top 10 features of each station are highlighted in orange, and the global FI averages are indicated as black dashes. (c) Repeating the geopotential height from Fig. 2 to aid geographical interpretation of the results.

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0076.1

Figure 3a depicts the importance distributions of all features for all models. This global view reveals that buoyancy-related features such as radiation ΣR, sensible heat flux QH, and cloud cover Σcc are key variables for Cn2 prediction. Also, the shear exponent α is picked up, indicating that the OTCliM models identify wind shear as a modulating factor of turbulence. The dependency on wind direction ΣX suggests that upstream effects influence Cn2 prediction at many sites, while atmospheric stability is likely captured next to QH by the temperature differences ΣΔT and the boundary layer height hi. All of the above aligns with our physical knowledge of atmospheric turbulence, indicating that the models picked up physically meaningful relations between ERA5-generated variables and Cn2. However, the boxplots also show outliers for several features, indicating that not all models agree with the global view.

The FI values are displayed per station in Fig. 3b, allowing us to assess the outliers in more detail. The ten features with the highest average FI for each location are highlighted in orange. Since different stations exhibit different top 10 features, 20 features are shown in total, but their markers are colored in light gray if the features are not in the top 10 of the site in question. For reference, the black lines for each feature represent the global average FI values according to Fig. 3a. Comparing all per-station plots on a high level reveals two key points. First, although the FI values associated with the five models trained for each site exhibit some scatter, the overall importance of features at one site is consistent between models. Therefore, we conclude that the models captured features representative of that site throughout the training years. Second, the color coding highlights that different features are important for different sites, suggesting that the processes dominating turbulence vary between locations.

The three urban stations, BKLN, QUEE, and STAT, and the coastal Southold (SOUT) site deviate most obviously from the mean FI distribution. In particular, BKLN and QUEE show an above-average dependency of estimated Cn2 on wind direction. Both stations are located on rooftops and have some tall buildings in their neighborhood (Brotzge et al. 2020). Such inhomogeneous urban conditions are known to influence local measurements (WMO 2023), so it seems reasonable that turbulence strength observed at BKLN and QUEE depends more strongly on wind direction compared to other more rural locations. All four stations (urban + SOUT) also show below-average dependency on QH, which we view as a feature capturing the diurnal cycle. Instead, the models seem to have picked Σh′ or hi as predictors for this information. This behavior differs from all other models, where QH is typically the second most important input, in line with more traditional physics-based Cn2 models, such as W71. This shift, however, need not be viewed as unfavorable but as a demonstration of the power of ML-based modeling to shift from traditional to “unconventional” features if complex flow conditions require it. The downside is that we expect such models to perform more poorly when applied to other sites where the more traditional features are relevant. Especially, BKLN is expected to generalize poorly because it also depends strongly on the temperature features T2, Td2, and Tsoil, deemed irrelevant by almost all other models.

Less drastic but distinct differences are also visible between the nonurban models. Wind shear α, for example, has lower-than-average importance assigned for models trained on lake shores [Fredonia (FRED), Burt (BURT), Ontario (ONTA)]. The models of another set of stations [Whitehall (WHIT), Voorheesville (VOOR), Schuylerville (SCHU), Penn Yan (PENN), Owego (OWEG)] picked up the GWD rate. These sites are located in valleys or mountainous areas where gravity waves could modulate near-surface turbulence (Lilly 1972; Palmer et al. 1986). Still, the dependency is small, and stations Chazy (CHAZ) and Red Hook (REDH), located at the end of valleys but still surrounded by mountains, do not depend on GWD. A more detailed study of the sites’ climatologies would be needed, which is beyond the scope of this work.

Overall, the FI values of most models represent the known physical dependencies of atmospheric turbulence. That supports our confidence that our OTCliM approach is well suited for MCP. The unconventional features picked up by some models are viewed as an advantage of ML-based methods to arrive at accurate predictions even in complex environments. However, we assume that geographical generalization will be more difficult for such models, which is addressed later in this paper.

6. Results: OTCliM performance

After establishing that all OTCliM models picked up physically meaningful dependencies, we turn our analysis to quantifying their performance. The foundation for this analysis is 85 models trained individually for the 17 NYSM stations and the five training years. The temporal extrapolation capabilities of each model are quantified in section 6a to assess the suitability of OTCliM for MCP. To put the MCP scores achieved by our approach into perspective, we compare them to the two baseline models, W71 and the in situ–based GBM models. Since we have a network of stations available, we also assess the geographical generalizability of the OTCliM models by applying models trained on one site to all other sites. The results of this cross-site evaluation study are discussed in section 6b.

a. Temporal extrapolation

The temporal extrapolation performance of each model f^s,t with respect to the correlation coefficient r and the scaled RMSE ϵ is presented in Figs. 4a and 4b, respectively. The heatmaps for both metrics show the MCP scores s,t with their site-averaged values ¯s being compared to the baseline models in the accompanying scatterplots. The heatmap reveals similar patterns for both metrics: the score variance across training years (rows) is low for almost all stations, while the site-specific performance (columns) varies notably. In other words, regardless of the year of Cn2 observations used for training, the MCP performance stays consistent, whereas Cn2 at some locations (e.g., QUEE, BKLN, SOUT) is easier to predict from ERA5 than at others (e.g., VOOR, WHIT). It is unexpected but impressive that the urban models marked with (*) are the highest performing ones due to the typically complex urban climatology, which makes traditional modeling difficult (Rotach et al. 2005). Figures 4c and 4d show a few randomly drawn batches of observed Cn2 with their corresponding predictions. These curves highlight that the overall performance at SOUT (Fig. 4c) is better than that of WHIT (Fig. 4d) for two reasons. First, the observed Cn2 (black) at WHIT exhibits more high-frequency oscillations than SOUT, which are missing in the predictions (red). Second, complex short-duration events seem to occur more frequently for WHIT, which are only partially captured. Both effects indicated that WHIT is subject to more complex flows than SOUT and that OTCliM misses these details. We attribute this to missing details in ERA5, as will be discussed later. For now, we conclude that the models for both locations capture the general trends of Cn2 well and that the observed intrasite performance variation is primarily due to smoothed predictions. For practical applications such as obtaining Cn2 climatology, the large-scale behavior of Cn2 is the most relevant, so the smoothing is of limited concern.

Fig. 4.
Fig. 4.

Performance of OTCliM models compared to the baseline models. The heatmaps show the performance of individual OTCliM models trained with 1 year of Cn2 observations from site s when evaluated on the four hold-out years of the same site. The scatterplots compare the site-averaged performance against the baseline estimates. Seven-day batches of Cn2 are randomly drawn to compare observations (black) against predictions (red) for (c) a high-accuracy and (d) a lower-accuracy OTCliM model, with complex turbulence conditions shaded in gray.

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0076.1

The insight that the training year selection has a low influence on the model performance has high practical relevance. It suggests that models can be trained on archive data and do not necessarily require recent observations if and only if the site’s climatology remains comparable between the training period and the envisioned period of model application. Drastic changes in the station’s surroundings, e.g., through construction projects or climate change, would likely break this assumption and, thus, the model’s applicability.

The variability across models trained for different stations is likely due to ERA5 not fully capturing all local effects modulating the turbulence. For example, some landscape features, such as small lakes around PENN and WARS, are smaller than the ERA5 resolution, so they are missed. The relevance of these local features is highlighted when the OTCliM performance (black circles) is compared to that of the in situ baseline models in r¯s and ϵ¯s scatterplots. While the in situ models exhibit a small performance scatter around their mean accuracy (dashed lines), a generally widening performance gap is visible, which sets off small (QUEE) and grows toward the lower end of the performance ranking. Remember that only the input data differ between OTCliM and in situ (ERA5 vs in situ observations); the GBM training framework remains the same. Consequently, the growing gap between the model variants can be attributed to ERA5 holding less predictive power for some locations than others. Expectedly, the models with lower accuracy (e.g., WHIT, VOOR, PENN) are located close to lakes or in mountainous areas, which are usually subject to complex flows that ERA5 might not well capture. Surprisingly, the ERA5-based urban models perform almost on par with the in situ–based models. This result is impressive, given that urban flows are complex and likely not well represented in ERA5. Nevertheless, as discussed earlier, GBM seemingly circumvented this issue by shifting from traditional to unconventional features in urban cases. These features still seem to hold enough predictive power for the high performance observed here.

A final comparison between OTCliM and the traditional W71 parameterization in the scatterplots of Figs. 4a and 4b shows that OTCliM clearly outperforms W71. That is because the underlying similarity function [cf. Eq. (3)] is empirically determined based on the flow over a flat, unobstructed plain (Wyngaard et al. 1971) and does not adapt to local topography or climatology. This limitation becomes especially evident for the urban sites QUEE and BKLN, where r¯s is much below the W71 average (dashed line). Our OTCliM models, on the other hand, perform well at these locations, highlighting the advantage of accounting for local climatology in complex atmospheric conditions. In summary, the presented scores demonstrate the capability of our OTCliM approach to accurately extrapolate a 1-yr Cn2 time series to multiple years with ERA5 input for a diverse set of locations.

b. Cross-site evaluation

Next, we assess the geographical generalization capabilities of the OTCliM models by assessing their performance when being evaluated across different sites. Each model f^s,t trained on one site s is asked for Cn2 predictions based on 5 years of ERA5 input data of all other sites (s˜S\s). The resulting cross-site (c/s) correlation coefficient and RMSE scores are presented in Fig. 5. Each cell (s,s˜) of the heatmaps corresponds to the average score achieved by the five OTCliM models trained on site s (rows) and evaluated on site s˜ (columns). The heatmaps on the left present the scores normalized by the MCP scores, reflecting the performance degradation of a model when applied to other sites compared to performance on the hold-out years of the original training site. The heatmaps on the right show the absolute scores, and the histograms in the middle display the distribution of both relative and absolute scores with diagonals excluded, i.e., without the MCP performance. The rows and columns of all heatmaps are ordered by mean generalization performance quantified as rss˜s˜S\s.

Fig. 5.
Fig. 5.

The c/s evaluation performance of OTCliM models. The rows of the heatmaps present the performance of the models trained on site s when evaluated on all other sites s˜S. The left heatmaps show the performance degradation compared to the MCP case, i.e., the relative scores ss˜=ss˜/ss, with representing r or ϵ, and the right ones show the absolute scores for reference. The histograms depict the distributions of both heatmaps. (a) Relative and absolute correlation coefficient r (higher is better). (b) Relative and absolute RMSE ϵ (lower is better).

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0076.1

Overall, the geographical generalization works very well for the nonurban sites. For half of all cases, correlation and RMSE performance degrades by not more than 16% for r and 25% for ϵ compared to the original MCP scores. Considering 75% of the cases, corresponding to the c/s scores of almost all nonurban sites, the absolute performance falls as low as r = 0.56 and ϵ = 0.63, which is still significantly better than the W71 baseline scores of r¯s=0.3 and ϵ¯s=1.15 (cf. Fig. 4). In other words, any nonurban OTCliM model applied to any other nonurban site yields significantly better performance than the traditional approach.

As expected from the FI analysis, the urban models do not perform well in nonurban locations. Both relative and absolute c/s scores in Fig. 5 are low for STAT, QUEE, and BKLN, resulting in long tails in the c/s histograms. However, the urban models perform better at other urban sites, as indicated by the small square of higher performance in the bottom-right corner of the c/s matrices. It seems that the performance of urban generalization could be linked to the degree of urbanization. BKLN is the most urban site, QUEE has more vegetation than BKLN, and STAT has more vegetation than QUEE, which is reflected in the progression of c/s performance: BKLN generalizes most poorly and depends on the most uncommon features, QUEE performs slightly better, and STAT performs even better with FI values similar to the average case. Nevertheless, this observation might be unique to these New York City stations, which are all relatively close to each other and experience similar mesoscale conditions.

7. Conclusions

This study presents OTCliM, a gradient-boosting-based measure–correlate–predict approach to obtain climatologies of optical turbulence strength from 1 year of Cn2 observations and multiyear ERA5 reference data. A feature importance (FI) analysis based on SHAP values revealed that most models, especially the nonurban ones, learn similar dependencies. The dominating features reflect the known and expected physical dependencies, suggesting that the OTCliM models captured physically meaningful relations between ERA5 and Cn2. The FI results also suggest features such as boundary layer dissipation rate or gravity wave dissipation rate as Cn2 dependencies, which are not considered in traditional models. This result demonstrates how machine learning models could hint at new insights into turbulent processes. In contrast to most stations, the three urban locations differed significantly from the average FI distribution. We assume that such models learned very different input–output relations, hindering them from generalizing well to other sites. In a temporal extrapolation study, we demonstrate that the OTCliM models accurately predict four held-out years of Cn2 from ERA5 inputs across 17 diverse sites in New York State. The models’ performances stay consistent regardless of the year of data selected for training, suggesting archive data can also be used for training. The trained models are also found suitable for stations located in complex environments such as valleys or cities, which are often difficult to model traditionally. Minor accuracy variations between models from different sites are visible but can be attributed to local details missing in the coarse ERA5 input data. For some regions in the world, regional reanalysis datasets are available at higher resolution [e.g., Europe (Copernicus Climate Change Service 2021) or Australia (Su et al. 2019)] or with more assimilated data [e.g., North America (Mesinger et al. 2006)], which might help to reduce this problem. Already with ERA5 input, OTCliM models significantly outperform traditional analytical Cn2 models and achieve scores close to those of GBM models trained on in situ inputs. A geographical generalization study also showed that applying nonurban models to other nonurban sites yields high scores in many cases, indicating that these models generalize well. The performance only degrades significantly when the urban models are evaluated at nonurban sites. While these models performed very well in the temporal extrapolation test case, the low scores for cross-site application indicate that the urban models learned very site-specific relations.

The key conclusion for practice from our work is that 1 year of Cn2 observations is sufficient to obtain reliable site-specific Cn2 climatologies using OTCliM. Measuring Cn2 at sites of interest for only 1 year instead of multiple years would free up instruments sooner. As a result, site survey costs could be reduced, or more locations could be surveyed for the same costs. OTCliM’s high performance in urban environments is highly relevant for FSOC, where terminals can be expected to be located in both rural and urban locations. The good geographical generalization results of the nonurban models indicate that future OTCliM models trained on a few stations can potentially be used to estimate the spatial distribution of Cn2 over larger areas. Such Cn2 maps could provide a first indication of locations suitable for FSOC or astronomy with respect to optical turbulence. However, the region-specific features identified in the FI analysis for lake or mountain environments also suggest that the models must account for local effects. Developing such multiregion models requires more work before large-scale Cn2 estimates can be considered reliable. As a first step, users can perform an FI analysis on their trained models and use the results as guidance for expected geographical generalization. For example, if the models pick up unconventional features, they will likely not generalize to other sites, but if features are more traditional, they might.

Finally, we believe that OTCliM can be highly relevant also for Cn2 forecasts. The rapidly advancing ML weather prediction (MLWP) models, such as GraphCast (Lam et al. 2023) or Artificial Intelligence Forecasting System (AIFS) (Lang et al. 2024), are typically trained on ERA5 and, therefore, produce ERA5-like forecasts. Currently, not all variables required for OTCliM are available from MLWP, but in principle, OTCliM can be used to translate the MLWP forecasts to Cn2 forecasts. Similarly, one could train OTCliM on historical data from the Global Forecast System (GFS) (NCEP 2015), a traditional numerical weather prediction system, and convert the GFS forecasts to Cn2 forecasts.

Besides all these advantages, the critical assumption of OTCliM is that the observed year of Cn2 represents the temporal extrapolation range. If the surroundings change drastically through, e.g., construction, this assumption breaks down, and OTCliM predictions can no longer be considered valid. Also, our proposed approach currently only predicts near-surface Cn2, but for astronomy and FSOC, higher level Cn2 or full profiles are also relevant. In a previous study, we presented an ML-based framework that can combine Cn2 from multiple levels into one physics-inspired model (Pierzyna et al. 2023b) for multilevel Cn2 predictions. How this approach can be integrated into OTCliM remains open for future work. In summary, our presented OTCliM approach is shown to be highly accurate in diverse meteorological conditions with the potential for geographical generalization. We believe that OTCliM is a relevant tool for the optical turbulence community for future site surveys and related studies.

Acknowledgments.

This publication is part of the project FREE—Optical Wireless Superhighways: Free photons (at home and in space) (with project P19-13) of the research programme TTW-Perspectief, which is (partly) financed by the Dutch Research Council (NWO). This research is made possible by the New York State (NYS) Mesonet. Original funding for the NYS Mesonet (NYSM) buildup was provided by Federal Emergency Management Agency Grant FEMA-4085-DR-NY. The continued operation and maintenance of the NYSM is supported by National Mesonet Program, University at Albany, Federal and private grants, and others. Sukanta Basu is grateful for financial support from the State Universityof New York’s Empire Innovation Program.

Data availability statement.

The Python code for training and the trained models are available on GitHub: https://github.com/mpierzyna/otclim.

APPENDIX A

Ratio of Stable to Unstable Conditions in Dataset Before and After QA/QC

A three-step quality assurance (QA) and quality control (QC) procedure is described in section 4a(2), which aims at filtering unphysical values from our Cn2 dataset to lay the foundation for high-quality OTCliM models. As discussed in the respective section, the QA/QC procedure discards approximately half of the data points across all stations and all years. We aim to not significantly change the ratio of unstable to stable atmospheric conditions by QA/QC to expose the OTCliM models to turbulence conditions representative of the respective site. In Fig. A1, we present the distribution of atmospheric stability before and after QA/QC for each location. Stability is quantified through the bulk potential temperature gradient Γ = (θ9θ2)/7 m, which is computed from the potential temperature θ measured by two thermometers at 2 and 9 m above ground (NYS Mesonet 2023). These instruments differ from the sonic anemometers used to estimate Cn2. They are more reliable, resulting in only 1.5% of missing values on average due to instrumentation problems compared to 16% for the sonic anemometers. The only exception is the QUEE site, which did not have any 9-m observations during our time of interest, which is why it is omitted in this discussion. Figure A1 shows that the unstable/stable ratio (see bar chart insets) does not change much at most sites due to quality control. In general, the number of stable condition samples decreases because stable conditions occur more frequently in winter when instrument malfunctioning due to rain, ice, or snow is more common. Additionally, the inertial range shrinks with increasing stability and can even disappear (Grachev et al. 2013), resulting in more failed fits of the 2/3 slope (QA/QC step one). Since the balance between stable and unstable cases remains better than approximately 30/70 for all stations, we view our dataset as reasonably well-balanced and adequate for training.

Fig. A1.
Fig. A1.

Distribution of bulk potential temperature gradient Γ = (θ9θ2)/7 m before (blue) and after (orange) applying the QA/QC s in section 4a(2). The split between unstable (Γ < 0, solid bar) and stable (Γ > 0, hatched bar) atmospheric conditions is visualized by the bar charts in each panel or quantified in the respective legend. The QUEE site is omitted fully because of a malfunctioning instrument.

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0076.1

APPENDIX B

Scaling of log10Cn2 Target Data

As described in section 4a(3) of the main text, the log10Cn2 data are normalized using the 25th and 75th percentiles of the site-specific log10Cn2 training data [cf. Eq. (10)]. The effect of this scaling is illustrated in Fig. B1, which presents the unscaled (left) and scaled (right) data colored by location. The plots demonstrate that the scaling makes the range of log10Cn2 more comparable between sites, which is crucial to obtain comparable performance and FI scores. That is especially important for the urban BKLN site (dark blue), where the unscaled right tail of the distribution is shifted to the right by half an order of magnitude (higher OT strength) compared to the other sites. To keep the scaling robust and general (i.e., applicable to non-Gaussian distributions), we scale using the interquartile range and not the commonly used standard deviation.

Fig. B1.
Fig. B1.

Comparison between distributions of (left) unscaled and (right) scaled log10Cn2. Colors indicate the different sites at which Cn2 is observed with urban sites marked by (*).

Citation: Artificial Intelligence for the Earth Systems 4, 2; 10.1175/AIES-D-24-0076.1

APPENDIX C

Details about Baseline Models

This section details how the traditional W71 Cn2 parameterization and the in situ–based GBM models are set up and utilized as performance baselines for our proposed OTCliM approach.

a. W71 Cn2 parameterization

All variables on which the W71 equations [cf. Eqs. (1)(3)] depend are available from ERA5 (cf. Table 2). Only the dynamical sensible heat flux QH from ERA5 needs to be converted to its kinematic form wθ¯ as wθ¯=QH/(ρcp), where ρ and cp are the density and specific heat capacity of air (Stull 1988). Also, the sign of QH is flipped because ERA5 treats an upward QH as negative, whereas W71 expects the upward wθ¯ to be positive. After this conversion, we utilize the same ERA5 data extracted for the 17 NYSM stations that we use to train the OTCliM models to evaluate the W71 equations. The results are seventeen 5-yr Cn2 time series, which can be compared to the observed Cn2 evolution at each station. The corresponding RMSE and r scores are presented in Fig. 4.

b. In situ GBM models

These upper-bound models are similar to the OTCliM models but use different input data. The OTCliM model of a specific site s is a GBM model (cf. section 3a) that employs ERA5 input data extracted from the grid box containing site s and Cn2 observed at site s. Instead of ERA5, one can utilize in situ weather data observed at s. The resulting GBM in situ models are quite similar, for example, to work presented by Wang and Basu (2016), Jellen et al. (2021), or Pierzyna et al. (2023b). As for selecting ERA5 variables (cf. section 4b), we aim to capture buoyancy and wind shear with multiple features based on the instruments deployed at the NYSM stations. As buoyancy features, we compute the temperature gradient Γ = (T9T2)/(9 − 2 m) from the observed 2- and 9-m temperature and utilize the observed 30-min sensible heat flux and observed incoming radiation. Assuming the velocity at the ground to be close to zero, we obtain a crude estimate of the bulk wind shear S = M10/10 m, where M10 is the horizontal wind speed at 10 m. Additionally, observed friction velocity u*, latent heat flux wq¯, and the dewpoint spread ΔTd = T2Td2 are included as features, where Td2 is the 2-m dewpoint temperature. To account for atmospheric stability, Γ and S are combined into a bulk Richardson number as Ri=(g/T¯)Γ/S2 with g = 9.81 m s−2. Upstream effects are captured by adding sine and cosine of the 10-m wind direction. All these variables form the in situ dataset, which is used to train GBM models in the same fashion as OTCliM: each in situ model is trained on 1 year of observations from site s and evaluated on the hold-out years of the same site. The resulting scores are presented as upper baseline in Fig. 4.

REFERENCES

  • Andreas, E. L., 1988: Estimating C n 2 over snow and sea ice from meteorological data. J. Opt. Soc. Amer. A, 5, 481495, https://doi.org/10.1364/JOSAA.5.000481.

    • Search Google Scholar
    • Export Citation
  • Arockia Bazil Raj, A., J. Arputha Vijaya Selvi, and S. Durairaj, 2015: Comparison of different models for ground-level atmospheric turbulence strength (C n 2) prediction with a new model according to local weather data for FSO applications. Appl. Opt., 54, 802815, https://doi.org/10.1364/AO.54.000802.

    • Search Google Scholar
    • Export Citation
  • Beason, M., G. Potvin, D. Sprung, J. McCrae, and S. Gladysz, 2024: Comparative analysis of C n 2 estimation methods for sonic anemometer data. Appl. Opt., 63, E94E106, https://doi.org/10.1364/AO.520976.

    • Search Google Scholar
    • Export Citation
  • Beyrich, F., O. K. Hartogensis, H. A. R. de Bruin, and H. C. Ward, 2021: Scintillometers. Springer Handbook of Atmospheric Measurements, T. Foken, Ed., Springer Handbooks Springer, 969997, https://doi.org/10.1007/978-3-030-52171-4_34.

    • Search Google Scholar
    • Export Citation
  • Bolbasova, L. A., A. A. Andrakhanov, and A. Y. Shikhovtsev, 2021: The application of machine learning to predictions of optical turbulence in the surface layer at Baikal Astrophysical Observatory. Mon. Not. Roy. Astron. Soc., 504, 60086017, https://doi.org/10.1093/mnras/stab953.

    • Search Google Scholar
    • Export Citation
  • Brotzge, J. A., and Coauthors, 2020: A technical overview of the New York State Mesonet standard network. J. Atmos. Oceanic Technol., 37, 18271845, https://doi.org/10.1175/JTECH-D-19-0220.1.

    • Search Google Scholar
    • Export Citation
  • Carta, J. A., S. Velázquez, and P. Cabrera, 2013: A review of Measure-Correlate-Predict (MCP) methods used to estimate long-term wind characteristics at a target site. Renewable Sustainable Energy Rev., 27, 362400, https://doi.org/10.1016/j.rser.2013.07.004.

    • Search Google Scholar
    • Export Citation
  • Chen, T., and C. Guestrin, 2016: XGBoost: A scalable tree boosting system. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Association for Computing Machinery, 785794, https://doi.org/10.1145/2939672.2939785.

    • Search Google Scholar
    • Export Citation
  • Cherubini, T., and S. Businger, 2013: Another look at the refractive index structure function. J. Appl. Meteor. Climatol., 52, 498506, https://doi.org/10.1175/JAMC-D-11-0263.1.

    • Search Google Scholar
    • Export Citation
  • Cherubini, T., R. Lyman, and S. Businger, 2021: Forecasting seeing for the Maunakea observatories with machine learning. Mon. Not. Roy. Astron. Soc., 509, 232245, https://doi.org/10.1093/mnras/stab2916.

    • Search Google Scholar
    • Export Citation
  • Copernicus Climate Change Service, 2021: CERRA sub-daily regional reanalysis data for Europe on single levels from 1984 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS), accessed 1 August 2024, https://doi.org/10.24381/CDS.622A565A.

    • Search Google Scholar
    • Export Citation
  • Fuchs, C., and F. Moll, 2015: Ground station network optimization for space-to-ground optical communication links. J. Opt. Commun. Networking, 7, 11481159, https://doi.org/10.1364/JOCN.7.001148.

    • Search Google Scholar
    • Export Citation
  • Grachev, A. A., E. L. Andreas, C. W. Fairall, P. S. Guest, and P. O. G. Persson, 2013: The critical Richardson number and limits of applicability of local similarity theory in the stable boundary layer. Bound.-Layer Meteor., 147, 5182, https://doi.org/10.1007/s10546-012-9771-0.

    • Search Google Scholar
    • Export Citation
  • Hardy, J. W., 1998: Adaptive Optics for Astronomical Telescopes. Oxford Series in Optical and Imaging Sciences, Vol. 16, Oxford University Press, 438 pp.

    • Search Google Scholar
    • Export Citation
  • He, P., and S. Basu, 2016: Development of similarity relationships for energy dissipation rate and temperature structure parameter in stably stratified flows: A direct numerical simulation approach. Environ. Fluid Mech., 16, 373399, https://doi.org/10.1007/s10652-015-9427-y.

    • Search Google Scholar
    • Export Citation
  • Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 19992049, https://doi.org/10.1002/qj.3803.

    • Search Google Scholar
    • Export Citation
  • Hill, F., and Coauthors, 2006: Site testing for the advanced technology solar telescope. Proc. SPIE, 6276, 62671T, https://doi.org/10.1117/12.673677.

    • Search Google Scholar
    • Export Citation
  • ITU, 2007: Prediction methods required for the design of terrestrial free-space optical links. Recommendation Rec. ITU-R P.1814, International Telecommunication Union, 12 pp., https://www.itu.int/dms_pubrec/itu-r/rec/p/R-REC-P.1814-0-200708-I!!PDF-E.pdf.

    • Search Google Scholar
    • Export Citation
  • Jahid, A., M. H. Alsharif, and T. J. Hall, 2022: A contemporary survey on free space optical communication: Potentials, technical challenges, recent advances and research direction. J. Network Comput. Appl., 200, 103311, https://doi.org/10.1016/j.jnca.2021.103311.

    • Search Google Scholar
    • Export Citation
  • Jellen, C., J. Burkhardt, C. Brownell, and C. Nelson, 2020: Machine learning informed predictor importance measures of environmental parameters in maritime optical turbulence. Appl. Opt., 59, 63796389, https://doi.org/10.1364/AO.397325.

    • Search Google Scholar
    • Export Citation
  • Jellen, C., M. Oakley, C. Nelson, J. Burkhardt, and C. Brownell, 2021: Machine-learning informed macro-meteorological models for the near-maritime environment. Appl. Opt., 60, 29382951, https://doi.org/10.1364/AO.416680.

    • Search Google Scholar
    • Export Citation
  • Kaimal, J. C., and J. E. Gaynor, 1991: Another look at sonic thermometry. Bound.-Layer Meteor., 56, 401410, https://doi.org/10.1007/BF00119215.

    • Search Google Scholar
    • Export Citation
  • Kartal, S., S. Basu, and S. J. Watson, 2023: A decision-tree-based measure–correlate–predict approach for peak wind gust estimation from a global reanalysis dataset. Wind Energy Sci., 8, 15331551, https://doi.org/10.5194/wes-8-1533-2023.

    • Search Google Scholar
    • Export Citation
  • Kaushal, H., and G. Kaddoum, 2017: Optical communication in space: Challenges and mitigation techniques. IEEE Commun. Surv. Tutorials, 19, 5796, https://doi.org/10.1109/COMST.2016.2603518.

    • Search Google Scholar
    • Export Citation
  • Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, 2017: LightGBM: A highly efficient gradient boosting decision tree. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., 31493157, https://dl.acm.org/doi/10.5555/3294996.3295074.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2023: Learning skillful medium-range global weather forecasting. Science, 382, 14161421, https://doi.org/10.1126/science.adi2336.

    • Search Google Scholar
    • Export Citation
  • Lang, S., and Coauthors, 2024: AIFS—ECMWF’s data-driven forecasting system. arXiv, 2406.01465v2, https://doi.org/10.48550/arXiv.2406.01465.

    • Search Google Scholar
    • Export Citation
  • Lilly, D. K., 1972: Wave momentum flux—A GARP problem. Bull. Amer. Meteor. Soc., 53, 1724, https://doi.org/10.1175/1520-0477-53.1.17.

    • Search Google Scholar
    • Export Citation
  • Lundberg, S. M., and S.-I. Lee, 2017: A unified approach to interpreting model predictions. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., 47684777, https://dl.acm.org/doi/10.5555/3295222.3295230.

    • Search Google Scholar
    • Export Citation
  • Masciadri, E., J. Vernin, and P. Bougeault, 1999: 3D mapping of optical turbulence using an atmospheric numerical model. Astron. Astrophys. Suppl. Ser., 137, 185202, https://doi.org/10.1051/aas:1999474.

    • Search Google Scholar
    • Export Citation
  • Mesinger, F., and Coauthors, 2006: North American regional reanalysis. Bull. Amer. Meteor. Soc., 87, 343360, https://doi.org/10.1175/BAMS-87-3-343.

    • Search Google Scholar
    • Export Citation
  • Milli, J., T. Rojas, B. Courtney-Barrer, F. Bian, J. Navarrete, F. Kerber, and A. Otarola, 2020: Turbulence nowcast for the Cerro Paranal and Cerro Armazones observatory sites. arXiv, 2012.05674v2, https://doi.org/10.48550/arXiv.2012.05674.

    • Search Google Scholar
    • Export Citation
  • Moene, A. F., 2003: Effects of water vapour on the structure parameter of the refractive index for near-infrared radiation. Bound.-Layer Meteor., 107, 635653, https://doi.org/10.1023/A:1022807617073.

    • Search Google Scholar
    • Export Citation
  • Molnar, C., 2022: Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2nd ed. Christoph Molnar, 317 pp.

  • Monin, A. S., and A. M. Obukhov, 1954: Basic laws of turbulent mixing in the surface layer of the atmosphere. Contrib. Geophys. Inst. Acad. Sci., 151, e187.

    • Search Google Scholar
    • Export Citation
  • NCEP, 2015: NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory, accessed 1 August 2024, https://doi.org/10.5065/D65D8PWK.

    • Search Google Scholar
    • Export Citation
  • NYS Mesonet, 2023: New York State Mesonet flux network data. 9 pp., https://nysmesonet.org/documents/NYSM_Readme_Flux.pdf.

  • Palmer, T. N., G. J. Shutts, and R. Swinbank, 1986: Alleviation of a systematic westerly bias in general circulation and numerical weather prediction models through an orographic gravity wave drag parametrization. Quart. J. Roy. Meteor. Soc., 112, 10011039, https://doi.org/10.1002/qj.49711247406.

    • Search Google Scholar
    • Export Citation
  • Pham, T. V., H. Yamano, and I. Susumu, 2023: A placement method of ground stations for optical satellite communications considering cloud attenuation. IEICE Commun. Express, 12, 568571, https://doi.org/10.23919/comex.2023XBL0092.

    • Search Google Scholar
    • Export Citation
  • Pierzyna, M., R. Saathof, and S. Basu, 2023a: A multi-physics ensemble modeling framework for reliable C n 2 estimation. Proc. SPIE, 12731, 127310N, https://doi.org/10.1117/12.2680997.

    • Search Google Scholar
    • Export Citation
  • Pierzyna, M., R. Saathof, and S. Basu, 2023b: Π-ML: A dimensional analysis-based machine learning parameterization of optical turbulence in the atmospheric surface layer. Opt. Lett., 48, 44844487, https://doi.org/10.1364/OL.492652.

    • Search Google Scholar
    • Export Citation
  • Pierzyna, M., O. Hartogensis, S. Basu, and R. Saathof, 2024: Intercomparison of flux-, gradient-, and variance-based optical turbulence (C n 2) parameterizations. Appl. Opt., 63, E107E119, https://doi.org/10.1364/AO.519942.

    • Search Google Scholar
    • Export Citation
  • Poulenard, S., M. Crosnier, and A. Rissons, 2015: Ground segment design for broadband geostationary satellite with optical feeder link. J. Opt. Commun. Networking, 7, 325336, https://doi.org/10.1364/JOCN.7.000325.

    • Search Google Scholar
    • Export Citation
  • Rotach, M. W., and Coauthors, 2005: BUBBLE—An urban boundary layer meteorology project. Theor. Appl. Climatol., 81, 231261, https://doi.org/10.1007/s00704-004-0117-9.

    • Search Google Scholar
    • Export Citation
  • Sadot, D., and N. S. Kopeika, 1992: Forecasting optical turbulence strength on the basis of macroscale meteorology and aerosols: Models and validation. Opt. Eng., 31, 200212, https://doi.org/10.1117/12.56059.

    • Search Google Scholar
    • Export Citation
  • Savage, M. J., 2009: Estimation of evaporation using a dual-beam surface layer scintillometer and component energy balance measurements. Agric. For. Meteor., 149, 501517, https://doi.org/10.1016/j.agrformet.2008.09.012.

    • Search Google Scholar
    • Export Citation
  • Schöck, M., and Coauthors, 2009: Thirty meter telescope site testing I: Overview. Publ. Astron. Soc. Pac., 121, 384395, https://doi.org/10.1086/599287.

    • Search Google Scholar
    • Export Citation
  • Spiliotis, E., 2022: Decision trees for time-series forecasting. Foresight Int. J. Appl. Forecasting, 1, 3044.

  • Stull, R. B., 1988: An Introduction to Boundary Layer Meteorology. Kluwer Academic Publishers, 666 pp.

  • Su, C., X. Wu, S. Wu, Q. Yang, Y. Han, C. Qing, T. Luo, and Y. Liu, 2021: In situ measurements and neural network analysis of the profiles of optical turbulence over the Tibetan Plateau. Mon. Not. Roy. Astron. Soc., 506, 34303438, https://doi.org/10.1093/mnras/stab1792.

    • Search Google Scholar
    • Export Citation
  • Su, C.-H., and Coauthors, 2019: BARRA v1.0: The Bureau of Meteorology atmospheric high-resolution Regional Reanalysis for Australia. Geosci. Model Dev., 12, 20492068, https://doi.org/10.5194/gmd-12-2049-2019.

    • Search Google Scholar
    • Export Citation
  • Vorontsov, A. M., M. A. Vorontsov, G. A. Filimonov, and E. Polnau, 2020: Atmospheric turbulence study with deep machine learning of intensity scintillation patterns. Appl. Sci., 10, 8136, https://doi.org/10.3390/app10228136.

    • Search Google Scholar
    • Export Citation
  • Wang, C., Q. Wu, M. Weimer, and E. Zhu, 2021: FLAML: A fast and lightweight AutoML library. Proc. Mach. Learn. Syst., 3, 434447.

  • Wang, H., B. Li, X. Wu, C. Liu, Z. Hu, and P. Xu, 2015: Prediction model of atmospheric refractive index structure parameter in coastal area. J. Mod. Opt., 62, 13361346, https://doi.org/10.1080/09500340.2015.1037801.

    • Search Google Scholar
    • Export Citation
  • Wang, Y., and S. Basu, 2016: Using an artificial neural network approach to estimate surface-layer optical turbulence at Mauna Loa, Hawaii. Opt. Lett., 41, 23342337, https://doi.org/10.1364/OL.41.002334.

    • Search Google Scholar
    • Export Citation
  • Wesely, M. L., 1976: The combined effect of temperature and humidity fluctuations on refractive index. J. Appl. Meteor., 15, 4349, https://doi.org/10.1175/1520-0450(1976)015<0043:TCEOTA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • WMO, 2023: Volume III—Observing systems. WMO 8, 428 pp., https://community.wmo.int/en/activity-areas/imop/wmo-no_8.

  • Wyngaard, J. C., Y. Izumi, and S. A. Collins, 1971: Behavior of the refractive-index-structure parameter near the ground. J. Opt. Soc. Amer., 61, 16461650, https://doi.org/10.1364/JOSA.61.001646.

    • Search Google Scholar
    • Export Citation
  • Zhang, R., J. Huang, X. Wang, J. A. Zhang, and F. Huang, 2016: Effects of precipitation on sonic anemometer measurements of turbulent fluxes in the atmospheric surface layer. J. Ocean Univ. China, 15, 389398, https://doi.org/10.1007/s11802-016-2804-4.

    • Search Google Scholar
    • Export Citation
Save
  • Andreas, E. L., 1988: Estimating C n 2 over snow and sea ice from meteorological data. J. Opt. Soc. Amer. A, 5, 481495, https://doi.org/10.1364/JOSAA.5.000481.

    • Search Google Scholar
    • Export Citation
  • Arockia Bazil Raj, A., J. Arputha Vijaya Selvi, and S. Durairaj, 2015: Comparison of different models for ground-level atmospheric turbulence strength (C n 2) prediction with a new model according to local weather data for FSO applications. Appl. Opt., 54, 802815, https://doi.org/10.1364/AO.54.000802.

    • Search Google Scholar
    • Export Citation
  • Beason, M., G. Potvin, D. Sprung, J. McCrae, and S. Gladysz, 2024: Comparative analysis of C n 2 estimation methods for sonic anemometer data. Appl. Opt., 63, E94E106, https://doi.org/10.1364/AO.520976.

    • Search Google Scholar
    • Export Citation
  • Beyrich, F., O. K. Hartogensis, H. A. R. de Bruin, and H. C. Ward, 2021: Scintillometers. Springer Handbook of Atmospheric Measurements, T. Foken, Ed., Springer Handbooks Springer, 969997, https://doi.org/10.1007/978-3-030-52171-4_34.

    • Search Google Scholar
    • Export Citation
  • Bolbasova, L. A., A. A. Andrakhanov, and A. Y. Shikhovtsev, 2021: The application of machine learning to predictions of optical turbulence in the surface layer at Baikal Astrophysical Observatory. Mon. Not. Roy. Astron. Soc., 504, 60086017, https://doi.org/10.1093/mnras/stab953.

    • Search Google Scholar
    • Export Citation
  • Brotzge, J. A., and Coauthors, 2020: A technical overview of the New York State Mesonet standard network. J. Atmos. Oceanic Technol., 37, 18271845, https://doi.org/10.1175/JTECH-D-19-0220.1.

    • Search Google Scholar
    • Export Citation
  • Carta, J. A., S. Velázquez, and P. Cabrera, 2013: A review of Measure-Correlate-Predict (MCP) methods used to estimate long-term wind characteristics at a target site. Renewable Sustainable Energy Rev., 27, 362400, https://doi.org/10.1016/j.rser.2013.07.004.

    • Search Google Scholar
    • Export Citation
  • Chen, T., and C. Guestrin, 2016: XGBoost: A scalable tree boosting system. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Association for Computing Machinery, 785794, https://doi.org/10.1145/2939672.2939785.

    • Search Google Scholar
    • Export Citation
  • Cherubini, T., and S. Businger, 2013: Another look at the refractive index structure function. J. Appl. Meteor. Climatol., 52, 498506, https://doi.org/10.1175/JAMC-D-11-0263.1.

    • Search Google Scholar
    • Export Citation
  • Cherubini, T., R. Lyman, and S. Businger, 2021: Forecasting seeing for the Maunakea observatories with machine learning. Mon. Not. Roy. Astron. Soc., 509, 232245, https://doi.org/10.1093/mnras/stab2916.

    • Search Google Scholar
    • Export Citation
  • Copernicus Climate Change Service, 2021: CERRA sub-daily regional reanalysis data for Europe on single levels from 1984 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS), accessed 1 August 2024, https://doi.org/10.24381/CDS.622A565A.

    • Search Google Scholar
    • Export Citation
  • Fuchs, C., and F. Moll, 2015: Ground station network optimization for space-to-ground optical communication links. J. Opt. Commun. Networking, 7, 11481159, https://doi.org/10.1364/JOCN.7.001148.

    • Search Google Scholar
    • Export Citation
  • Grachev, A. A., E. L. Andreas, C. W. Fairall, P. S. Guest, and P. O. G. Persson, 2013: The critical Richardson number and limits of applicability of local similarity theory in the stable boundary layer. Bound.-Layer Meteor., 147, 5182, https://doi.org/10.1007/s10546-012-9771-0.

    • Search Google Scholar
    • Export Citation
  • Hardy, J. W., 1998: Adaptive Optics for Astronomical Telescopes. Oxford Series in Optical and Imaging Sciences, Vol. 16, Oxford University Press, 438 pp.

    • Search Google Scholar
    • Export Citation
  • He, P., and S. Basu, 2016: Development of similarity relationships for energy dissipation rate and temperature structure parameter in stably stratified flows: A direct numerical simulation approach. Environ. Fluid Mech., 16, 373399, https://doi.org/10.1007/s10652-015-9427-y.

    • Search Google Scholar
    • Export Citation
  • Hersbach, H., and Coauthors, 2020: The ERA5 global reanalysis. Quart. J. Roy. Meteor. Soc., 146, 19992049, https://doi.org/10.1002/qj.3803.

    • Search Google Scholar
    • Export Citation
  • Hill, F., and Coauthors, 2006: Site testing for the advanced technology solar telescope. Proc. SPIE, 6276, 62671T, https://doi.org/10.1117/12.673677.

    • Search Google Scholar
    • Export Citation
  • ITU, 2007: Prediction methods required for the design of terrestrial free-space optical links. Recommendation Rec. ITU-R P.1814, International Telecommunication Union, 12 pp., https://www.itu.int/dms_pubrec/itu-r/rec/p/R-REC-P.1814-0-200708-I!!PDF-E.pdf.

    • Search Google Scholar
    • Export Citation
  • Jahid, A., M. H. Alsharif, and T. J. Hall, 2022: A contemporary survey on free space optical communication: Potentials, technical challenges, recent advances and research direction. J. Network Comput. Appl., 200, 103311, https://doi.org/10.1016/j.jnca.2021.103311.

    • Search Google Scholar
    • Export Citation
  • Jellen, C., J. Burkhardt, C. Brownell, and C. Nelson, 2020: Machine learning informed predictor importance measures of environmental parameters in maritime optical turbulence. Appl. Opt., 59, 63796389, https://doi.org/10.1364/AO.397325.

    • Search Google Scholar
    • Export Citation
  • Jellen, C., M. Oakley, C. Nelson, J. Burkhardt, and C. Brownell, 2021: Machine-learning informed macro-meteorological models for the near-maritime environment. Appl. Opt., 60, 29382951, https://doi.org/10.1364/AO.416680.

    • Search Google Scholar
    • Export Citation
  • Kaimal, J. C., and J. E. Gaynor, 1991: Another look at sonic thermometry. Bound.-Layer Meteor., 56, 401410, https://doi.org/10.1007/BF00119215.

    • Search Google Scholar
    • Export Citation
  • Kartal, S., S. Basu, and S. J. Watson, 2023: A decision-tree-based measure–correlate–predict approach for peak wind gust estimation from a global reanalysis dataset. Wind Energy Sci., 8, 15331551, https://doi.org/10.5194/wes-8-1533-2023.

    • Search Google Scholar
    • Export Citation
  • Kaushal, H., and G. Kaddoum, 2017: Optical communication in space: Challenges and mitigation techniques. IEEE Commun. Surv. Tutorials, 19, 5796, https://doi.org/10.1109/COMST.2016.2603518.

    • Search Google Scholar
    • Export Citation
  • Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, 2017: LightGBM: A highly efficient gradient boosting decision tree. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., 31493157, https://dl.acm.org/doi/10.5555/3294996.3295074.

    • Search Google Scholar
    • Export Citation
  • Lam, R., and Coauthors, 2023: Learning skillful medium-range global weather forecasting. Science, 382, 14161421, https://doi.org/10.1126/science.adi2336.

    • Search Google Scholar
    • Export Citation
  • Lang, S., and Coauthors, 2024: AIFS—ECMWF’s data-driven forecasting system. arXiv, 2406.01465v2, https://doi.org/10.48550/arXiv.2406.01465.

    • Search Google Scholar
    • Export Citation
  • Lilly, D. K., 1972: Wave momentum flux—A GARP problem. Bull. Amer. Meteor. Soc., 53, 1724, https://doi.org/10.1175/1520-0477-53.1.17.

    • Search Google Scholar
    • Export Citation
  • Lundberg, S. M., and S.-I. Lee, 2017: A unified approach to interpreting model predictions. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., 47684777, https://dl.acm.org/doi/10.5555/3295222.3295230.

    • Search Google Scholar
    • Export Citation
  • Masciadri, E., J. Vernin, and P. Bougeault, 1999: 3D mapping of optical turbulence using an atmospheric numerical model. Astron. Astrophys. Suppl. Ser., 137, 185202, https://doi.org/10.1051/aas:1999474.

    • Search Google Scholar
    • Export Citation
  • Mesinger, F., and Coauthors, 2006: North American regional reanalysis. Bull. Amer. Meteor. Soc., 87, 343360, https://doi.org/10.1175/BAMS-87-3-343.

    • Search Google Scholar
    • Export Citation
  • Milli, J., T. Rojas, B. Courtney-Barrer, F. Bian, J. Navarrete, F. Kerber, and A. Otarola, 2020: Turbulence nowcast for the Cerro Paranal and Cerro Armazones observatory sites. arXiv, 2012.05674v2, https://doi.org/10.48550/arXiv.2012.05674.

    • Search Google Scholar
    • Export Citation
  • Moene, A. F., 2003: Effects of water vapour on the structure parameter of the refractive index for near-infrared radiation. Bound.-Layer Meteor., 107, 635653, https://doi.org/10.1023/A:1022807617073.

    • Search Google Scholar
    • Export Citation
  • Molnar, C., 2022: Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2nd ed. Christoph Molnar, 317 pp.

  • Monin, A. S., and A. M. Obukhov, 1954: Basic laws of turbulent mixing in the surface layer of the atmosphere. Contrib. Geophys. Inst. Acad. Sci., 151, e187.

    • Search Google Scholar
    • Export Citation
  • NCEP, 2015: NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory, accessed 1 August 2024, https://doi.org/10.5065/D65D8PWK.

    • Search Google Scholar
    • Export Citation
  • NYS Mesonet, 2023: New York State Mesonet flux network data. 9 pp., https://nysmesonet.org/documents/NYSM_Readme_Flux.pdf.

  • Palmer, T. N., G. J. Shutts, and R. Swinbank, 1986: Alleviation of a systematic westerly bias in general circulation and numerical weather prediction models through an orographic gravity wave drag parametrization. Quart. J. Roy. Meteor. Soc., 112, 10011039, https://doi.org/10.1002/qj.49711247406.

    • Search Google Scholar
    • Export Citation
  • Pham, T. V., H. Yamano, and I. Susumu, 2023: A placement method of ground stations for optical satellite communications considering cloud attenuation. IEICE Commun. Express, 12, 568571, https://doi.org/10.23919/comex.2023XBL0092.

    • Search Google Scholar
    • Export Citation
  • Pierzyna, M., R. Saathof, and S. Basu, 2023a: A multi-physics ensemble modeling framework for reliable C n 2 estimation. Proc. SPIE, 12731, 127310N, https://doi.org/10.1117/12.2680997.

    • Search Google Scholar
    • Export Citation
  • Pierzyna, M., R. Saathof, and S. Basu, 2023b: Π-ML: A dimensional analysis-based machine learning parameterization of optical turbulence in the atmospheric surface layer. Opt. Lett., 48, 44844487, https://doi.org/10.1364/OL.492652.

    • Search Google Scholar
    • Export Citation
  • Pierzyna, M., O. Hartogensis, S. Basu, and R. Saathof, 2024: Intercomparison of flux-, gradient-, and variance-based optical turbulence (C n 2) parameterizations. Appl. Opt., 63, E107E119, https://doi.org/10.1364/AO.519942.

    • Search Google Scholar
    • Export Citation
  • Poulenard, S., M. Crosnier, and A. Rissons, 2015: Ground segment design for broadband geostationary satellite with optical feeder link. J. Opt. Commun. Networking, 7, 325336, https://doi.org/10.1364/JOCN.7.000325.

    • Search Google Scholar
    • Export Citation
  • Rotach, M. W., and Coauthors, 2005: BUBBLE—An urban boundary layer meteorology project. Theor. Appl. Climatol., 81, 231261, https://doi.org/10.1007/s00704-004-0117-9.

    • Search Google Scholar
    • Export Citation
  • Sadot, D., and N. S. Kopeika, 1992: Forecasting optical turbulence strength on the basis of macroscale meteorology and aerosols: Models and validation. Opt. Eng., 31, 200212, https://doi.org/10.1117/12.56059.

    • Search Google Scholar
    • Export Citation
  • Savage, M. J., 2009: Estimation of evaporation using a dual-beam surface layer scintillometer and component energy balance measurements. Agric. For. Meteor., 149, 501517, https://doi.org/10.1016/j.agrformet.2008.09.012.

    • Search Google Scholar
    • Export Citation
  • Schöck, M., and Coauthors, 2009: Thirty meter telescope site testing I: Overview. Publ. Astron. Soc. Pac., 121, 384395, https://doi.org/10.1086/599287.

    • Search Google Scholar
    • Export Citation
  • Spiliotis, E., 2022: Decision trees for time-series forecasting. Foresight Int. J. Appl. Forecasting, 1, 3044.

  • Stull, R. B., 1988: An Introduction to Boundary Layer Meteorology. Kluwer Academic Publishers, 666 pp.

  • Su, C., X. Wu, S. Wu, Q. Yang, Y. Han, C. Qing, T. Luo, and Y. Liu, 2021: In situ measurements and neural network analysis of the profiles of optical turbulence over the Tibetan Plateau. Mon. Not. Roy. Astron. Soc., 506, 34303438, https://doi.org/10.1093/mnras/stab1792.

    • Search Google Scholar
    • Export Citation
  • Su, C.-H., and Coauthors, 2019: BARRA v1.0: The Bureau of Meteorology atmospheric high-resolution Regional Reanalysis for Australia. Geosci. Model Dev., 12, 20492068, https://doi.org/10.5194/gmd-12-2049-2019.

    • Search Google Scholar
    • Export Citation
  • Vorontsov, A. M., M. A. Vorontsov, G. A. Filimonov, and E. Polnau, 2020: Atmospheric turbulence study with deep machine learning of intensity scintillation patterns. Appl. Sci., 10, 8136, https://doi.org/10.3390/app10228136.

    • Search Google Scholar
    • Export Citation
  • Wang, C., Q. Wu, M. Weimer, and E. Zhu, 2021: FLAML: A fast and lightweight AutoML library. Proc. Mach. Learn. Syst., 3, 434447.

  • Wang, H., B. Li, X. Wu, C. Liu, Z. Hu, and P. Xu, 2015: Prediction model of atmospheric refractive index structure parameter in coastal area. J. Mod. Opt., 62, 13361346, https://doi.org/10.1080/09500340.2015.1037801.

    • Search Google Scholar
    • Export Citation
  • Wang, Y., and S. Basu, 2016: Using an artificial neural network approach to estimate surface-layer optical turbulence at Mauna Loa, Hawaii. Opt. Lett., 41, 23342337, https://doi.org/10.1364/OL.41.002334.

    • Search Google Scholar
    • Export Citation
  • Wesely, M. L., 1976: The combined effect of temperature and humidity fluctuations on refractive index. J. Appl. Meteor., 15, 4349, https://doi.org/10.1175/1520-0450(1976)015<0043:TCEOTA>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • WMO, 2023: Volume III—Observing systems. WMO 8, 428 pp., https://community.wmo.int/en/activity-areas/imop/wmo-no_8.

  • Wyngaard, J. C., Y. Izumi, and S. A. Collins, 1971: Behavior of the refractive-index-structure parameter near the ground. J. Opt. Soc. Amer., 61, 16461650, https://doi.org/10.1364/JOSA.61.001646.

    • Search Google Scholar
    • Export Citation
  • Zhang, R., J. Huang, X. Wang, J. A. Zhang, and F. Huang, 2016: Effects of precipitation on sonic anemometer measurements of turbulent fluxes in the atmospheric surface layer. J. Ocean Univ. China, 15, 389398, https://doi.org/10.1007/s11802-016-2804-4.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Proposed OTCliM approach to extrapolate a measured 1-yr time series of OT strength (golden yellow) to multiple years (orange) based on ERA5 reference data (blue). Robust yearly Cn2 statistics can be obtained from the extrapolated data.

  • Fig. 2.

    ERA5 representation of NYSM domain with true locations of NYSM flux stations (gray) and corresponding ERA5 grid box (orange) containing the stations. Urban sites are marked with (*).

  • Fig. 3.

    SHAP value–based FI of all OTCliM models (a) aggregated and (b) per training site. In (b), urban stations are marked with (*), the top 10 features of each station are highlighted in orange, and the global FI averages are indicated as black dashes. (c) Repeating the geopotential height from Fig. 2 to aid geographical interpretation of the results.

  • Fig. 4.

    Performance of OTCliM models compared to the baseline models. The heatmaps show the performance of individual OTCliM models trained with 1 year of Cn2 observations from site s when evaluated on the four hold-out years of the same site. The scatterplots compare the site-averaged performance against the baseline estimates. Seven-day batches of Cn2 are randomly drawn to compare observations (black) against predictions (red) for (c) a high-accuracy and (d) a lower-accuracy OTCliM model, with complex turbulence conditions shaded in gray.

  • Fig. 5.

    The c/s evaluation performance of OTCliM models. The rows of the heatmaps present the performance of the models trained on site s when evaluated on all other sites s˜S. The left heatmaps show the performance degradation compared to the MCP case, i.e., the relative scores ss˜=ss˜/ss, with representing r or ϵ, and the right ones show the absolute scores for reference. The histograms depict the distributions of both heatmaps. (a) Relative and absolute correlation coefficient r (higher is better). (b) Relative and absolute RMSE ϵ (lower is better).

  • Fig. A1.

    Distribution of bulk potential temperature gradient Γ = (θ9θ2)/7 m before (blue) and after (orange) applying the QA/QC s in section 4a(2). The split between unstable (Γ < 0, solid bar) and stable (Γ > 0, hatched bar) atmospheric conditions is visualized by the bar charts in each panel or quantified in the respective legend. The QUEE site is omitted fully because of a malfunctioning instrument.

  • Fig. B1.

    Comparison between distributions of (left) unscaled and (right) scaled log10Cn2. Colors indicate the different sites at which Cn2 is observed with urban sites marked by (*).

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 169 169 114
PDF Downloads 75 75 56