1. Introduction
Radiation is a key component of the global energy budget. In the shortwave part of the spectrum (mostly solar radiation, with wavelengths ≲ 4 µm), incoming radiation is much greater in the tropics than at the poles. This imbalance, which is due to Earth–sun geometry, sets up a meridional gradient in absorbed shortwave radiation that drives the global circulation [sections 4.6 and 10.1.1 of Wallace and Hobbs (2006)]. Surface albedo has a secondary effect on absorbed shortwave radiation: at high latitudes the surface is often covered by snow and ice, which increases albedo and causes less shortwave radiation to be absorbed. This enhances the meridional gradient in absorbed shortwave radiation. In the longwave part of the spectrum (mostly terrestrial radiation, with wavelengths ≳ 4 µm), there is also an albedo effect: areas with high albedo, typically at high latitude, are colder and emit less longwave radiation. In terms of net radiation (absorbed shortwave minus emitted longwave), the two albedo effects approximately cancel out. Thus, in a globally and annually averaged sense, the meridional distribution of net radiation is similar to that of absorbed shortwave radiation (Stone 1978).
Radiative transfer is inherently complex, due to the spectral (wavelength-dependent) nature of gaseous absorption, as well as changes in the refractive index and shape of particles acting to scatter and absorb radiation. The most accurate RTMs are line-by-line models, which explicitly simulate gaseous absorption in each spectral band (Turner et al. 2004; Mlawer and Turner 2016). However, the radiative properties of clouds and aerosols are much smoother in spectral space than those of gaseous molecules. Thus, simpler scattering models can be used for clouds and aerosols (e.g., Stamnes et al. 1988). Nonetheless, both line-by-line and scattering models are extremely computationally expensive, so cannot be used as parameterizations in NWP. There is an inherent trade-off between computational cost and accuracy, and the goal is typically to reduce computational cost by orders of magnitude without a large reduction in accuracy.
Perhaps the most common approach is correlated-k models, like the Rapid Radiative Transfer Model (RRTM; Mlawer et al. 1997), which emulates line-by-line models but is many orders of magnitude faster. When implemented as a parameterization, an RTM must provide three variables to the parent NWP model for both the shortwave and longwave spectra: a vertical profile of radiative-heating rates, surface downwelling flux (
Due to these issues, some groups have used neural networks (part II of Goodfellow et al. 2016), a type of machine learning (ML), to emulate RTMs (Krasnopolsky 2020, and references therein). Neural networks are also popular for emulating other atmospheric processes, especially subgrid-scale convection in NWP models (Gentine et al. 2018; Brenowitz and Bretherton 2018; Brenowitz et al. 2020; Krasnopolsky 2020; Beucler et al. 2021). Because neural networks can theoretically approximate a function of arbitrary complexity, they are often called “universal function-approximators.” Although neural networks are often slow to train, at inference time (when applying a trained neural network to new data), they are much faster than process-based RTMs, even the RRTMG. Neural networks often contain many layers with many weights in each layer, allowing them to represent important features at various levels of abstraction, which they ultimately transform into predictions. However, each weight is one degree of freedom and neural networks often contain millions of weights, which makes them prone to overfitting. Also, ML is typically poor at extrapolating to conditions outside those seen in the training data. This diminishes the trustworthiness of ML, which is a key requirement for transitioning ML to operational products such as NWP (Gil et al. 2019).
We have developed neural networks to emulate shortwave radiative transfer, with three main characteristics that make our work unique. First, we use U-net++ models (Zhou et al. 2020), as opposed to the fully connected networks [sometimes called “dense” or “feed-forward”; see chapter 6 of Goodfellow et al. (2016)] used in previous work. U-net++ models are a type of deep learning, which can exploit spatial patterns in gridded data to make better predictions. Second, we have built physical constraints and vertical nonlocality into the U-net++ models, allowing them to handle nonadjacent cloud layers and better extrapolate to different conditions (e.g., from nontropical to tropical sites). Third, we train U-net++ models to emulate the RRTM, instead of the less accurate RRTMG used in previous work (Krasnopolsky et al. 2010, henceforth K10; Krasnopolsky 2020). Although line-by-line models are the most accurate, they are only slightly more accurate than the RRTM (Iacono et al. 2008) and many orders of magnitude slower, so emulating line-by-line models would vastly increase the time required to create training data for the U-net++ models.
The rest of this paper is organized as follows. Section 2 describes the inner workings of a U-net++, section 3 describes the input data and methods used to train the U-net++ models, section 4 describes experiments to find the best U-net++ configuration (hyperparameters), sections 5 and 6 evaluate and interpret the selected U-net++ models, and section 7 concludes.
2. Background on U-net++
This section focuses mainly on traditional U-nets, extending the discussion to U-net++ at the end. We use the Keras library for Python (Chollet et al. 2020) to implement all U-net++ models, and our code is freely available on the internet (see data availability statement).
U-nets are a specialized type of convolutional neural network (CNN; Fukushima 1980; Fukushima and Miyake 1982). CNNs are a deep-learning method (section 1.1.4 of Chollet 2018) designed to exploit spatial patterns in gridded data, which they achieve via convolution and pooling, spatial operations defined later in this section. CNNs have become popular tools in atmospheric science (Wang et al. 2016; Racah et al. 2017; Kurth et al. 2018; Bolton and Zanna 2019; Gagne et al. 2019; McGovern et al. 2019; Wimmers et al. 2019; Lagerquist et al. 2019; Ebert-Uphoff and Hilburn 2020; Lagerquist et al. 2020a,b). U-nets (Ronneberger et al. 2015) retain all the advantages of CNNs but are designed for pixelwise prediction1—i.e., to make a prediction at every grid point. CNNs are typically used for full-image prediction—i.e., to make one prediction based on the full grid. There are several U-net applications to atmospheric science in the refereed literature (Chen et al. 2021; Kumler-Bonfanti et al. 2020; Sadeghi et al. 2020; Sha et al. 2020a,b), and we are aware of several other atmospheric scientists currently adopting U-nets (Stewart et al. 2020; Berthomier and Pradel 2021; Felt et al. 2021; Hayatbini et al. 2021).
As shown in Fig. 1, a U-net contains four types of specialized components: convolutional layers, pooling (downsampling) layers, upsampling layers, and skip connections. The left side of the U-shape is the downsampling side, where spatial resolution decreases with depth, and the right side is the upsampling side, where resolution increases with depth. The convolutional layers detect spatial features, and the other components allow convolutional layers to detect features at various spatial resolutions, which is important due to the multiscale nature of atmospheric phenomena. Inputs to the first convolutional layer (top-left green box in Fig. 1) consist of raw predictors (here, physical variables like temperature and pressure), while inputs to all other layers consist of feature maps, which are transformed versions of the raw predictors. As the spatial resolution decreases, the number of feature maps (“channels”) typically increases, to offset the loss of spatial information. Convolution is both a spatial and multivariate transformation, so the feature maps encode spatial patterns that include all predictor variables. Most CNN applications involve data with two spatial dimensions (2D), for which the inner workings of a convolutional layer are illustrated in supplemental Fig. S1 of Lagerquist et al. (2020b). For 1D data like those used in the current work, see our online supplemental Fig. S1 (an animation). In general, a convolutional layer is followed by an activation function and possibly batch normalization (supplemental Table S2).
Each pooling layer downsamples the feature maps to a lower resolution (larger grid spacing), using either a maximum or mean filter. On the downsampling side of the U-net (left side of Fig. 1), feature maps at deeper layers contain higher-level abstractions, because they contain information from a wider variety of spatial scales and have passed through more convolutions. For 2D data, the inner workings of a pooling layer are illustrated in supplemental Fig. S2 of Lagerquist et al. (2020b). For 1D data, see our supplemental Fig. S2 (an animation).
Each upsampling layer upsamples the feature maps to a higher resolution, using an interpolation method such as nearest neighbor or linear. In this work we use nearest neighbor. However, the choice of interpolation method is unimportant: upsampling always consists of interpolation followed by convolution, because interpolation cannot adequately reconstruct high-resolution information from low-resolution information. On the upsampling side of the U-net (right side of Fig. 1), while spatial resolution increases the number of channels decreases, terminating in the number of output channels. In this work there is one output channel (radiative-heating rate, as discussed in section 3). For 1D data, the inner workings of an upsampling layer are shown in supplemental Fig. S3 (an animation).
Skip connections preserve high-resolution information from the downsampling side of the U-net and carry it to the upsampling side, as shown in Fig. 1. Without skip connections, the U-net would simply perform downsampling followed by upsampling, which is a lossy operation. In other words, upsampling cannot fully recover the high-resolution information lost during downsampling. On the upsampling side of the U-net, at each spatial resolution r (each row in Fig. 1), some feature maps are provided by the upsampling layer at the next-coarsest resolution (the row below in Fig. 1), while some are provided by a skip connection. The advantage of feature maps from the upsampling layer is that they contain higher-level abstractions, because they include information from more spatial scales and more convolutions. The advantage of feature maps from the skip connection is that they are truly at resolution r, not merely upsampled to r. In other words, for the skip connection the nominal and effective resolutions are both r, whereas for the upsampling layer the effective resolution is coarser than r. Feature maps from the skip connection and upsampling layer are both passed through a convolutional layer, which combines information from both (“the best of both worlds”).
Fully connected layers (sometimes called “dense”; see chapter 6 of Goodfellow et al. 2016) are designed for full-image prediction, so they are not typically included in a U-net. However, we include fully connected layers in our U-nets, because the task is a combination of pixelwise prediction (a vertical profile of radiative-heating rates) and full-image prediction (scalar fluxes). See section 3 for more on the output variables. Since fully connected layers are spatially agnostic, feature maps are flattened into a vector before they are passed to the fully connected layers (in Fig. 1, this is a vector of length 4 × 1024 = 4096). Each feature in one fully connected layer is a weighted sum of those in the previous layer. Like convolutional layers, each fully connected layer is followed by an activation function and possibly batch normalization.
Figure 1 shows a U-net with the traditional architecture (Ronneberger et al. 2015), but we have adopted the U-net++ architecture (Zhou et al. 2020), shown in Fig. 2. The U-net++ architecture contains more skip connections, allowing features from more than two scales to be combined at each level. For example, the set of feature maps labeled “D” in Fig. 2 is produced by combining A, B, and the upsampled version of C. Although these feature maps all have a nominal resolution of 18h (18 heights in the profile, or ~1/4 the resolution of the predictors), their effective resolutions, due to upsampling, are, respectively, 18h, 9h, and 4h. This ability to combine information from many scales at once can allow the U-net++ to make better predictions than the U-net (Zhou et al. 2020).
Before training, all weights (in the convolutional, fully connected, and batch-normalization layers) are initialized to random values; during training, they are adjusted to minimize the loss function. Our particular loss function is discussed in section 3c(2).
3. Data and methods
a. Data description
Like the RRTM, our U-net++ models assume horizontal independence and thus treat each vertical column separately. To create inputs (predictors) for the RRTM and U-net++ models, we use data from the Rapid Refresh (RAP) model (Benjamin et al. 2016). The RAP is a nonhydrostatic, mesoscale, operational NWP model, run every hour with 13-km horizontal grid spacing and 51 vertical levels. We have obtained RAP data from an internal NOAA archive in height coordinates, running from 10 to 50 000 m above ground level (m AGL), with 20-m vertical spacing near the surface and 4000-m vertical spacing near the top. We extract 0-h analyses of 14 variables (Table 1 and Fig. 3) from 30 sites throughout the Northern Hemisphere (Fig. 4), at every hour in the years 2017–20. We are currently emulating a simplified version of the RRTM, which assumes a climatological profile of trace gases (O3, CO2, CH4, etc.) and does not consider aerosols or precipitation (see future work in section 7), which is why the predictors do not include this information. Other than trace gases, aerosols, and precipitation, the main controls on radiative transfer are the solar zenith angle, albedo, profiles of atmospheric state variables (temperature and pressure), and profiles of the three water species. This explains our choice of predictors (Table 1).
Description of predictor variables. “Vector” means that the variable is defined at all 73 heights. If the cell does not contain a check mark, the variable is a scalar. Downward LWP at height z is LWC integrated from the top of the atmosphere down to z, and upward LWP at height z is LWC integrated from the surface up to z. The definitions of IWP and WVP are analogous.
To create desired outputs (“targets” or “labels” in the ML literature), we run the RRTM separately for each example, where one “example” is one profile at one time. The output variables are those required by an NWP model from a shortwave RTM, namely, the heating-rate profile and the two flux components:
b. Preprocessing
Before training U-net++ models, we preprocess the data in two ways. First, we split the data into training, validation, and testing sets. We split the data differently for the two experiments (section 4), as shown in Table 2. For each experiment, the datasets are mutually independent—i.e., any pair of datasets contains different years and/or different sites. Also, there is a 1-week gap between each pair of consecutive datasets, to eliminate temporal autocorrelation. Second, we normalize predictor and target variables, using the methods listed in Table 3. The procedure is described below for each scalar predictor2 x; only step 1 is applied to the target variables. Note that only the U-net++-training data (Table 2) are used for scaling, i.e., to compute percentiles in step 1. This ensures that no information from the isotonic-regression-training, validation, or testing set is used to train the U-net++. If it were, the four datasets would no longer be independent.
Uniformization. Transform x to a uniform distribution over [0, 1], by converting each value to its percentile over all x values in the U-net++-training set. Let the transformed variable be x′.
z-score normalization. Transform x′ to a standard Gaussian distribution (with mean of 0.0 and variance of 1.0), using the inverse of the cumulative density function (CDF).
Training, validation, and testing data for each experiment. “Nontropical” means both Arctic and midlatitude. “Assorted1” contains sites from all regions; “Assorted2” also contains sites from all regions that do not overlap with those in Assorted1. The validation and testing sets are used to evaluate bias-corrected U-net++ models (with isotonic regression).
Normalization of predictor and target variables for U-net++ models.
The purpose of normalizing target variables is similar: to ensure that they have equal ranges, so that one target variable cannot dominate the loss function. For example, in our dataset,
Note that we normalize only two target variables:
c. Knowledge-guided machine learning
We have devised three ways to make the U-net++ models knowledge-guided—i.e., to include physical relationships in the training—which is a key priority in ML applications to the geosciences (Reichstein et al. 2019; Gil et al. 2019).
1) Physically consistent and skillful net flux
2) Custom loss function to emphasize large heating rates
The first term in Eq. (3) is the dual-weighted mean squared error (MSE) for heating rates, and the second term is the MSE for fluxes. Using the dual-weighted MSE for heating rates, rather than the standard MSE, weights points with a large predicted or actual heating rate more heavily. In early experiments (not shown), we found that this is necessary to skillfully predict large heating rates. Large heating rates are important in many atmospheric regimes, including stratocumulus clouds and the upper stratosphere. Shortwave radiation is absorbed by liquid water at the top of a stratocumulus cloud, leading to diabatic heating and a turbulent circulation that maintains the cloud; this is why stratocumulus clouds tend to be long-lived (Morrison et al. 2012; Wood 2012). In the upper stratosphere, shortwave radiation is absorbed by ozone, leading to extreme diabatic heating (Iacono et al. 2008); this is why the temperature profile of the stratosphere increases with height. However, large heating rates in the troposphere are rare (Fig. 5d), making them difficult to predict unless they are emphasized with a custom loss function such as dual-weighted MSE. The flux components follow less skewed distributions (Figs. 5a–c), so no custom loss function is needed to make the U-net++ models skillfully predict extreme fluxes.
The U-net++ models predict heating rates in raw physical units (K day−1), and values in our dataset range from 0 to 42 K day−1, so the weight ranges from approximately 0 to 42. Meanwhile, the U-net++ models predict flux components in normalized units, ranging from 0 to 1. In early experiments (not shown), we tried balancing the two terms by setting α ≥ 1 in Eq. (3). However, we found that regardless of α, training is effectively partitioned into two phases. During early training, heating-rate predictions improve rapidly while flux predictions improve slowly; during late training, heating-rate predictions improve slowly while flux predictions improve rapidly. In other words, the U-net++ models learn to predict heating rates well, then learn to predict fluxes well. Thus, for models shown in the paper, we use α = 1.
3) Custom predictors to account for nonlocal effects
Our choice of predictors allows the U-net++ models to consider vertically nonlocal effects, which occur when the heating rate at height z is affected by predictors far away from z. Specifically, we include height-integrated paths of the three water species: downward and upward LWP, IWP, and WVP (Table 1). The raw RAP data include only concentrations of the three water species: LWC, IWC, and humidity. Height-integrated paths are crucial in many scenarios—e.g., to predict the heating-rate profile in a column with multilayer liquid cloud, like that shown in Fig. 3. The top cloud layer attenuates a lot of downwelling solar radiation, leading to large heating rates in the top cloud layer (around 5.5 km AGL in Fig. 3; the cloud layer itself is shown in Fig. 3b, and the resulting radiative heating is shown in Fig. 3d). However, lower cloud layers do not produce large heating rates, because at lower heights most downwelling solar radiation has already been attenuated by the top cloud layer (e.g., Turner et al. 2018). This is exemplified in Fig. 3 for the lower cloud layer, stretching from 0 to 2.4 km AGL. When trained with only concentrations and not paths, the U-net++ models cannot represent these relationships, which are typically vertically nonlocal because the cloud layers are far apart (more than a few grid cells from each other).
d. Isotonic regression for bias correction
Because isotonic regression is a univariate method (with one input variable and one output variable), we apply isotonic regression separately to heating rate at each height,
We use separate training data (sites and times) for U-net++ and isotonic regression, as shown in Table 2. If we used the same training data, isotonic regression would learn to bias correct the U-net++ models only for data that they have already “seen,” for which the U-net++ predictions are unrepresentatively good.
4. Hyperparameter experiments
A hyperparameter is a property of an ML model that, unlike the weights (sometimes called “parameters”), cannot be adjusted by training. We conduct two experiments to find the best U-net++ hyperparameters for emulating the shortwave RRTM. In experiment 1, we train ML models (U-net++ and isotonic regression) with data from nontropical sites in 2018–20, then test with data from tropical sites in 2017 (Table 2). This tests the ability of the ML models to generalize in both space and time. It is crucial that we test the ability to generalize in space, because although 30 sites are used for model development3 (Fig. 4), an ML-based parameterization would be applied to every site (horizontal grid location) in the NWP model. Also, extreme differences between the training and application data might be seen in other scenarios, such as climate change (if an ML model remains in production for long enough, it may be applied to a different climate than in the training data) and rare events (the application data may contain a weather pattern not found in the training data). In experiment 2, we train ML models with data from “Assorted1” sites in 2018–20, then test with data from “Assorted2” sites in 2017 (Table 2). The difference here is that both the Assorted1 and Assorted2 sites include all three regions: Arctic, midlatitude, and tropical. Thus, although to some extent the testing data for experiment 2 test the models’ ability to generalize in space (to different sites), this test is less stringent than in experiment 1 (to a completely different region). The goal of experiment 2 is to create the best possible ML model for use as a parameterization in NWP. We hypothesize that a model trained with data from all three regions will perform better than one trained with only nontropical data.
In both experiments we perform a grid search (section 11.4.3 of Goodfellow et al. 2016) to optimize hyperparameters. A grid search involves four steps: 1) define the experimental hyperparameters and values to be attempted for each, 2) train a model with every possible combination of values, 3) evaluate all models on the validation data, 4) select the model that performs best on validation data and evaluate it on testing data. We choose three experimental hyperparameters and attempt the values listed in Table 4: the number of fully connected layers, dropout rate for fully connected layers, and L2 weight for convolutional layers. The number of fully connected layers (dashed black arrows in Fig. 2) controls the complexity of features used to predict flux components, with more layers allowing for higher complexity. Although higher complexity would ideally improve predictions, the number of weights increases dramatically with the number of fully connected layers, which can lead to overfitting. Meanwhile, dropout (Hinton et al. 2012) and L2 are both regularization methods; regularization encourages a simpler model, which reduces overfitting. The amount of regularization increases with both the dropout rate and L2 weight [see section 4b of Lagerquist et al. (2020b) for details].
U-net++ models have many hyperparameters, and it is impossible to experiment with them all, due to combinatorial explosion. For example, at a conservative estimate of 20 hyperparameters, if we attempted 5 values for each, we would need to train 520 = 9.5 × 1013 U-net++ models. Training one U-net++ takes approximately 192 core hours on graphics-processing units (GPU) and 480 core hours on central processing units (CPU), so training more than a few hundred to a few thousand U-net++ models is infeasible. Some important fixed (nonexperimental) hyperparameters are listed in supplemental Tables S1 and S2, along with the value chosen for each and a justification. This leaves the three experimental hyperparameters listed in Table 4.
5. Model evaluation
a. Evaluation methods
For both experiments 1 and 2, we evaluate the selected model overall (on the whole testing set) and in three regime-based settings. First, we evaluate the model by cloud regime: on profiles with no liquid cloud, single-layer liquid cloud, and multilayer liquid cloud. For this purpose, a cloud layer is defined as a contiguous set of heights with LWC > 0 g m−3 and total LWP ≥ 25 g m−2. Clouds add immense complexity to radiative transfer, because they both absorb and scatter radiation, creating a discontinuity in the profile of extinction optical depth. Thus, a model that performs well in cloud-free situations is not guaranteed to perform well in cloudy situations. Also, radiative heating is a key process in the maintenance of stratocumulus clouds, which makes it key for climate prediction. Second, we evaluate the model by solar zenith angle. The zenith angle determines the amount of incoming top-of-atmosphere solar radiation, as well as its incidence angle, which determines the amount of atmosphere through which radiation must pass en route to the surface. A model that performs well for intermediate zenith angles, may not perform well when the sun is directly overhead (zenith angle of 0°) or on the horizon (90°). Third, we evaluate the model by site. Different sites around the globe have different properties not accounted for in the partitioning by cloud regime and zenith angle, such as temperature, albedo, and cloud type (e.g., stratocumulus clouds are very common in the Arctic).
We make abundant use of the reliability curve and attributes diagram. Although both graphics were initially developed for classification (i.e., to evaluate probabilistic predictions of an event), we have adapted them for regression (i.e., to evaluate real-valued predictions). For classification, the reliability curve plots predicted probability versus conditional event frequency and answers the question, “For a given probability, what is the expected event frequency?” For regression, the reliability curve plots the predicted value versus conditional mean observed value and answers the question, “For a given prediction, what is the expected observation?” For both classification and regression, a perfect reliability curve follows the x = y line (diagonal gray line in Fig. 6a). Meanwhile, the attributes diagram (Hsu and Murphy 1986) is a reliability curve with extra reference lines: the no-resolution line (horizontal gray in Fig. 6a), climatology line (vertical gray in Fig. 6a), and positive-skill area (blue shading in Fig. 6a). For classification, the no-resolution and climatology lines both correspond to the event frequency in the dataset; for regression, these lines correspond to the mean observation (in Fig. 6a, mean
Both the reliability curve and attributes diagram are useful for diagnosing conditional bias. For example, if a model has positive bias for low predictions and negative bias for high predictions, these biases may offset, making overall bias (on the whole testing set) negligible. Thus, using the reliability curve and attributes diagram fits our motif of conducting regime-based evaluation, since averaging over the whole testing set may obscure issues that occur in certain regimes. For the scalar target variables (flux components), we plot one attributes diagram for each (e.g., Figs. 6a-c). For the vector target variable (heating rate), we plot one reliability curve for each height (e.g., Figs. 6g-i), omitting the reference lines in the attributes diagram. The reference lines would be different for each of the 73 heights, and it is not feasible to show 73 sets of reference lines.
b. Experiment 1
Results of the hyperparameter experiment, used to select the preferred model, are relegated to the supplemental material. The main conclusion to note here is that the U-net++ performs best when the dropout rate and L2 weight are small (less regularization), which suggests that overfitting is not a serious problem for emulating the shortwave RRTM. This is surprising, as our experience with ML for atmospheric science indicates that overfitting is a serious problem and aggressive regularization is needed (e.g., Lagerquist et al. 2019, 2020b). We suspect that overfitting is less problematic for our task because it is a perfect-model experiment, where the ML model is trained to emulate another model (the shortwave RRTM), rather than to fit real-world observations, which have more noise and uncertainty. Ultimately, we select the model with three fully connected layers, a dropout rate of 0.1, and L2 weight of 10−6.5. Results shown in the rest of this section, for the selected model only, are based on testing data rather than validation data.
Figure 6 shows the model’s performance on the whole testing set (tropical sites in 2017). The mean absolute error (MAE) skill score is defined as
Figure 7 shows the model’s performance by cloud regime. In the attributes diagram for
Supplemental Fig. S20 is analogous to Fig. 7, except for a U-net++ that does not include Fnet in the loss function [i.e., one that uses the postprocessing approach discussed in section 3c(1)]. For examples with liquid cloud, predictions of
Figure 8 shows the model’s performance by site. In attributes diagrams for the flux components (Figs. 8a–c), reliability is nearly perfect, except that at a few sites, small positive predictions of
Figure 9 shows the model’s performance by zenith angle. For the sake of brevity, we show results for 1-km heating rate (lower troposphere), 10-km heating rate (upper troposphere in the testing data, which contain only tropical sites), 46-km heating rate (upper stratosphere; the height with the largest climatological heating rate), and Fnet. Correlation is the Pearson correlation between predictions and observations, which ranges from [−1, 1] and has an optimal value of 1. Kling–Gupta efficiency (KGE; Gupta et al. 2009) ranges from (−∞, 1], and the optimal value of 1 occurs when the predictions and observations have perfect correlation, equal means, and equal variances. The unitless scores (left column of Fig. 9) show that performance is worst at the extreme zenith angles, when the sun is close to directly overhead or the horizon. However, except correlation and KGE for 46-km heating rate, unitless scores are close to their optimal values, even at local minima. Meanwhile, scores with units (MAE, RMSE, and bias) are shown in the right column of Fig. 9. These scores are generally close to their optimum (0), except at zenith angles below 20°. At these zenith angles, the model has a negative bias for heating rate through most of the troposphere (including heights not shown) and negative bias for Fnet, caused by a large negative bias for
c. Experiment 2
Again, results of the hyperparameter experiment are relegated to the supplemental material. The main conclusion to note here is the same as for experiment 1: the U-net++ performs better with less regularization, which controls overfitting. Because models in experiment 2 are trained with data from all latitudes, this is the model that would be used in NWP. Ultimately, we select the model with 4 fully connected layers, a dropout rate of 0.0, and L2 weight of 10−7. Results shown in the rest of this section, for the selected model only, are based on testing data rather than validation data.
Figure 10 shows the model’s performance on the whole testing set (Assorted2 sites in 2017). For each flux component, the reliability is nearly perfect, as is the match between the observed and predicted histograms (Figs. 10a–c). For heating rate, all heights have an absolute bias ≪ 0.1 K day−1, including in the upper stratosphere (Fig. 10d). As for the tropical testing data in experiment 1, there is a spike in MAE at 46 km (Fig. 10e), due to absorption by ozone, but the corresponding dip in MAE skill score is small (Fig. 10f). In the reliability curve for heating rate (Fig. 10g), all heights are nearly perfect, except in the lower troposphere, where higher predictions are up to ~0.25 K day−1 too low.
Figure 11 shows the model’s performance by cloud regime. For each flux component and each cloud regime, the reliability is nearly perfect (Figs. 11a–c). For heating rate, all heights and all cloud regimes have an absolute bias ≪ 0.1 K day−1 (Fig. 11d), while values of MAE (Fig. 11e) and MAE skill score (Fig. 11f) are similar to the whole testing set. For all three cloud regimes, the reliability curves for heating rate (Figs. 11g–i) are nearly perfect, with two exceptions: jagged curves for multilayer cloud, due to small sample size, and the lower troposphere for single-layer cloud, where higher predictions are up ~0.5 K day−1 too low.
Figure 12 shows the model’s performance by site. For each flux component and each site, the reliability is nearly perfect (Figs. 12a–c), except an underprediction of ~50 W m−2 at the north pole for the lowest
Figure 13 shows the model’s performance by zenith angle. The unitless scores (left column) show that performance is worst at the extreme zenith angles, but in general scores are better than for experiment 1 (Fig. 9), including at the lowest zenith angles. This is because the model from experiment 2 is trained with more low zenith angles, due to the inclusion of tropical sites. Meanwhile, scores with units (right column of Fig. 13) are very close to their optimum (0), especially bias, at all zenith angles. This contrasts starkly with the results for experiment 1 (Fig. 9), where every target variable has substantial bias for zenith angles < 20°.
d. Additional analyses
Supplemental section Ca presents a Kolmogorov–Smirnov and bias-variance analysis for both selected models (from experiments 1 and 2). The main conclusions are (i) the models have more random variance than systematic bias; (ii) although the difference between the predicted and observed distributions of heating rate are small, they are generally significant at the 99% level (as determined by the Kolmogorov–Smirnov p value), because the sample sizes are large. Supplemental section Cb shows results on training, validation, and testing data for both selected models. Although both models overfit to some extent, results on the testing data are highly skillful, as discussed in sections 5b and 5c. Also, the model from experiment 1 overfits more, because it performs more extreme spatial generalization (from nontropical to tropical sites).
e. Comparison of selected models
Overall, the model from experiment 2 appears to outperform the model from experiment 1 on testing data, consistent with our hypothesis. The comparison is not perfectly apples-to-apples, because the two testing sets contain different collections of sites, but they have two sites in common, both in the tropics: the Perdido oil rig and Bishop, Grenada. According to the site-specific reliability curves for heating rate (cf. Figs. 8g–i and 12g–i), there is no substantial difference between the two models. According to the site-specific attributes diagrams for flux components (cf. Figs. 8a–c and 12a–c), site-specific error profiles for heating rates (cf. Figs. 8d–f and 12d–f), and results for the lowest zenith angles (cf. Figs. 9 and 13)—seen primarily in the tropics—the model from experiment 2 is significantly better.
Supplemental section Cc compares the two models on a second testing set, containing nontropical sites in 2017. The purpose of this analysis is to achieve a fairer comparison, using the same data. The model from experiment 2 performs better on the second testing set as well, even though it was trained with only some nontropical sites, while the model from experiment 1 was trained with all nontropical sites. We suspect that training on tropical sites allowed the model from experiment 2 to learn additional relationships that improve its performance on nontropical sites.
At inference time, both models (including the U-net++ and isotonic regression) can generate predictions for ~500 000 profiles in 1 min, while the shortwave RRTM can process ~50 profiles in 1 min. Thus, the ML models are ~104 times faster than the shortwave RRTM, which they emulate with impressive skill.
Last, supplemental section Cd compares the selected model (U-net++) from experiment 1 to a traditional U-net and fully connected neural network (FCNN), developed via similar hyperparameter searches. The U-net and U-net++ clearly and significantly (at the 99% level) outperform the FCNN, demonstrating the advantage of spatially aware layers (convolution and pooling). However, differences between the U-net and U-net++ are mixed, with the U-net performing better on some target variables and the U-net++ performing better on others. However, we believe that a major advantage of the U-net++ is superior performance on Fnet in profiles with multilayer cloud. Fnet is arguably the single most important target variable (i.e., more important than
6. Model interpretation
The permutation test measures the overall importance of each predictor variable, averaged over all grid points (i.e., all heights for vector predictors) and testing examples. There are four versions of the permutation test—forward single-pass, forward multipass, backward single-pass, and backward multipass—which each handle correlated predictors differently. The backward multipass test begins with all predictors permuted—i.e., randomly shuffled so that values are assigned to the wrong examples—and iteratively restores (puts back in the correct order) the most important predictor still permuted, until all predictors have been restored. The kth predictor to be restored is considered the kth-most important. For more details on the permutation test, see McGovern et al. (2019). We run the permutation test with one of two loss functions—the dual-weighted MSE for heating rates [first term in Eq. (3)] or standard MSE for flux components [second term in Eq. (3)]—so that we can determine the most important predictors for each type of output. Figure 14 shows results for the backward multipass test, and supplemental Figs. S21–S23 show results for the other versions, which are very similar. We run the permutation test for both selected models, from experiments 1 and 2.
With the heating-rate-only loss function, results for the two models (Figs. 14a,c) agree on the top four predictors: zenith angle, LWC, downward LWP, and relative humidity. In other words, the most important factors for radiative heating are sun angle, liquid water, and water vapor, with ice being much less important—likely because the dual-weighted MSE emphasizes large heating rates, which typically are not caused by ice clouds (Turner et al. 2018). With the flux-only loss function, results for the two models (Figs. 14b,d) agree on the top four predictors: downward LWP, LWC, zenith angle, and surface albedo. Surface albedo is especially important for
7. Summary and future work
We developed U-net++ models, a type of deep learning, to emulate the shortwave RRTM. The U-net++ architecture contains more skip connections than the traditional U-net architecture, which improved our flux predictions in profiles with multilayer cloud, while the inclusion of physical constraints improved both flux and heating-rate predictions for multilayer cloud. We bias-corrected the U-net++ models with isotonic regression, a simple ML method often used for this purpose. We conducted two hyperparameter experiments to find the best U-net++ configurations for predicting two output types: a heating-rate profile and three flux components (
We performed two experiments, with sites split among training and testing in different ways. In experiment 1, we trained the models on nontropical sites and tested on tropical sites, with the purpose of testing the models’ spatial-generalization ability under extreme conditions (to a completely different region). In experiment 2, we trained the models on assorted sites from all regions and tested on a different set of assorted sites from all regions, with the purpose of creating the best model possible for use as a parameterization in NWP. The selected model from experiment 1 showed impressive skill on the testing set (tropical sites), but with four notable deficiencies. First, it has a large bias and MAE for heating rate in the upper stratosphere, where radiative heating is dominated by ozone absorption. Second, the lowest
The remainder of this section focuses on the model from experiment 2, which outperforms the model from experiment 1. In addition to closely emulating the shortwave RRTM, this model is ~104 times faster than the shortwave RRTM. In terms of heating rate, our performance is better than the emulator of K10, which is a traditional (or fully connected) neural network. Their neural network achieves a profile root mean squared error (PRMSE; defined in K10) of 0.15 K day−1 (their Table 1), versus our 0.056 K day−1 on testing data.4 In terms of
We attribute the success of our models to four factors. The first is the adoption of U-nets, which are specially designed to learn from gridded data and make pixelwise predictions. The second is the adoption of the U-net++ architecture, which outperforms the traditional U-net architecture in predicting fluxes with multilayer cloud. The third factor is using isotonic regression for bias correction, and the fourth is knowledge-guided ML. We achieved knowledge-guided ML by incorporating a physical law [Eq. (2)] into the U-net++ models to ensure physically consistent and skillful Fnet predictions, developing a custom loss function [Eq. (3)] to emphasize large heating rates, and including custom predictors to allow vertical nonlocality in heating-rate predictions, which is especially important for examples with multilayer cloud.
We will continue this work along five lines. The first is developing models to emulate the full shortwave RRTM, including the effects of aerosols, precipitation, and nonclimatological profiles of trace gases. Second, we will also emulate the longwave RRTM, using a similar framework. Third, we will make the models grid-agnostic (insensitive to exact heights in the profile), so that they can be applied to NWP models with different vertical grids. Fourth, we will experiment with other neural-network architectures, such as the U-net 3+ (Huang et al. 2020), which contains “full-scale” skip connections, combining data from all spatial resolutions at once, rather than just neighboring resolutions as in the U-net++. Fifth, we will test the new models (emulating the full shortwave and longwave RRTM) online, i.e., inside an NWP model as parameterizations. Since the models developed herein are orders of magnitude faster than the RRTM, if they were integrated stably into NWP, they could also be called at every atmospheric time step, which should improve the overall accuracy of the NWP model and free up computing time for other improvements to NWP.
Acknowledgments
We acknowledge Christina Kumler for ideological input during this project, as well as exploratory work during the preparation phase. This work was partially supported by the NOAA Global Systems Laboratory, Cooperative Institute for Research in the Atmosphere, and NOAA Award NA19OAR4320073. Author Ebert-Uphoff’s work was partially supported by NSF AI Institute Grant 2019758 and NSF Grant 1934668.
Data availability statement
Input data (predictors and targets from 2017 to 2020) are available upon request from the authors, as well as the selected models (U-net++ and isotonic regression) for both experiments 1 and 2. We used version 1.0.0 of Machine Learning for Radiative Transfer (ML4RT; doi:10.5281/zenodo.4470077), a Python library managed by author Lagerquist, to train, evaluate, and interpret all ML models (both U-net++ and isotonic regression) in this work. Since U-net++ architecture is complicated, for each experiment we have included a script that creates the architecture for the selected U-net++. These can be found at scripts/make_best_architecture_exp1.py and scripts/make_best_architecture_exp2.py, respectively, in the Python library.
REFERENCES
Barlow, R., and H. Brunk, 1972: The isotonic regression problem and its dual. J. Amer. Stat. Assoc., 67, 140–147, https://doi.org/10.1080/01621459.1972.10481216.
Benjamin, S., and Coauthors, 2016: A North American hourly assimilation and model forecast cycle: The Rapid Refresh. Mon. Wea. Rev., 144, 1669–1694, https://doi.org/10.1175/MWR-D-15-0242.1.
Berthomier, L., and B. Pradel, 2021: Cloud cover nowcasting with deep learning. Conf. on Artificial Intelligence for Environmental Science, Virtual, Amer. Meteor. Soc., 12.9, https://ams.confex.com/ams/101ANNUAL/meetingapp.cgi/Paper/380983.
Beucler, T., M. Pritchard, S. Rasp, J. Ott, P. Baldi, and P. Gentine, 2021: Enforcing analytic constraints in neural networks emulating physical systems. Phys. Rev. Lett., 126, 098302, https://doi.org/10.1103/PhysRevLett.126.098302.
Bolton, T., and L. Zanna, 2019: Applications of deep learning to ocean data inference and subgrid parameterization. J. Adv. Model. Earth Syst., 11, 376–399, https://doi.org/10.1029/2018MS001472.
Brenowitz, N., and C. Bretherton, 2018: Prognostic validation of a neural network unified physics parameterization. Geophys. Res. Lett ., 45, 6289–6298, https://doi.org/10.1029/2018GL078510.
Brenowitz, N., T. Beucler, M. Pritchard, and C. Bretherton, 2020: Interpreting and stabilizing machine-learning parametrizations of convection. J. Atmos. Sci., 77, 4357–4375, https://doi.org/10.1175/JAS-D-20-0082.1.
Chen, M., X. Shi, Y. Zhang, D. Wu, and M. Guizani, 2017: Deep features learning for medical image analysis with convolutional autoencoder neural network. IEEE Trans. Big Data, 7, 750–758, https://doi.org/10.1109/TBDATA.2017.2717439.
Chen, Y., L. Bruzzone, L. Jiang, and Q. Sun, 2021: ARU-net: Reduction of atmospheric phase screen in SAR interferometry using attention-based deep residual U-net. IEEE Trans. Geosci. Remote Sens., 59, 5780–5793, https://doi.org/10.1109/TGRS.2020.3021765.
Chollet, F., 2018: Deep Learning with Python. Manning, 361.
Chollet, F., and Coauthors, 2020: Keras. GitHub, https://github.com/fchollet/keras.
Ebert-Uphoff, I., and K. Hilburn, 2020: Evaluation, tuning and interpretation of neural networks for working with images in meteorological applications. Bull. Amer. Meteor. Soc., 101, E2149–E2170, https://doi.org/10.1175/BAMS-D-20-0097.1.
Felt, V., S. Samsi, and M. Veillette, 2021: A comprehensive evaluation of deep neural network architectures for precipitation nowcasting. Conf. on Artificial Intelligence for Environmental Science, Virtual, Amer. Meteor. Soc., 2.4, https://ams.confex.com/ams/101ANNUAL/meetingapp.cgi/Paper/383115.
Fukushima, K., 1980: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern., 36, 193–202, https://doi.org/10.1007/BF00344251.
Fukushima, K., and S. Miyake, 1982: Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognit., 15, 455–469, https://doi.org/10.1016/0031-3203(82)90024-3.
Gagne, D., S. Haupt, D. Nychka, and G. Thompson, 2019: Interpretable deep learning for spatial analysis of severe hailstorms. Mon. Wea. Rev., 147, 2827–2845, https://doi.org/10.1175/MWR-D-18-0316.1.
Gentine, P., M. Pritchard, S. Rasp, G. Reinaudi, and G. Yacalis, 2018: Could machine learning break the convection parameterization deadlock? Geophys. Res. Lett., 45, 5742–5751, https://doi.org/10.1029/2018GL078202.
Gil, Y., and Coauthors, 2019: Intelligent systems for geosciences: An essential research agenda. Commun. ACM, 62, 76–84, https://doi.org/10.1145/3192335.
Goodfellow, I., Y. Bengio, and A. Courville, 2016: Deep Learning. MIT Press, 781 pp., https://www.deeplearningbook.org.
Gupta, H., H. Kling, K. Yilmax, and G. Martinez, 2009: Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. J. Hydrol., 377, 80–91, https://doi.org/10.1016/j.jhydrol.2009.08.003.
Hayatbini, N., A. Badrinath, W. Chapman, L. D. Monache, F. Cannon, P. Gibson, A. Subramanian, and F. Ralph, 2021: A two-stage deep learning framework to improve short range rainfall prediction. Conf. on Artificial Intelligence for Environmental Science, Virtual, Amer. Meteor. Soc., 819, https://ams.confex.com/ams/101ANNUAL/meetingapp.cgi/Paper/381949.
Hinton, G., N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, 2012: Improving neural networks by preventing co-adaptation of feature detectors. arXiv, https://arxiv.org/abs/1207.0580.
Hsu, W., and A. Murphy, 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. Int. J. Forecasting, 2, 285–293, https://doi.org/10.1016/0169-2070(86)90048-8.
Huang, H., and Coauthors, 2020: UNet 3+: A full-scale connected UNet for medical image segmentation. Int. Conf. on Acoustics, Speech, and Signal Processing, Barcelona, Spain, IEEE, https://doi.org/10.1109/ICASSP40776.2020.9053405.
Iacono, M., E. Mlawer, S. Clough, and J. Morcrette, 2000: Impact of an improved longwave radiation model, RRTM, on the energy budget and thermodynamic properties of the NCAR Community Climate Model, CCM3. J. Geophys. Res., 105, 14 873–14 890, https://doi.org/10.1029/2000JD900091.
Iacono, M., E. Mlawer, J. Delamere, S. Clough, J. Morcrette, and Y. Hou, 2005: Application of the Shortwave Radiative Transfer Model, RRTMG_SW, to the National Center for Atmospheric Research and National Centers for Environmental Prediction general circulation models. Atmospheric Radiation Measurement Science Team Meeting, Daytona Beach, FL, ARM, https://www.arm.gov/publications/proceedings/conf15/extended_abs/iacono_mj.pdf.
Iacono, M., J. Delamere, E. Mlawer, M. Shephard, S. Clough, and W. Collins, 2008: Radiative forcing by long-lived greenhouse gases: Calculations with the AER radiative transfer models. J. Geophys. Res., 113, D13103, https://doi.org/10.1029/2008JD009944.
Krasnopolsky, V., 2020: Using machine learning for model physics: An overview. arXiv, https://arxiv.org/abs/2002.00416.
Krasnopolsky, V., M. Fox-Rabinovitz, Y. Hou, S. Lord, and A. Belochitski, 2010: Accurate and fast neural network emulations of model radiation for the NCEP coupled Climate Forecast System: Climate simulations and seasonal predictions. Mon. Wea. Rev., 138, 1822–1842, https://doi.org/10.1175/2009MWR3149.1.
Kumler-Bonfanti, C., J. Stewart, D. Hall, and M. Govett, 2020: Tropical and extratropical cyclone detection using deep learning. J. Appl. Meteor. Climatol., 59, 1971–1985, https://doi.org/10.1175/JAMC-D-20-0117.1.
Kurth, T., and Coauthors, 2018: Exascale deep learning for climate analytics. Int. Conf. for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, IEEE, https://doi.org/10.1109/SC.2018.00054.
Lagerquist, R., A. McGovern, and D. Gagne, 2019: Deep learning for spatially explicit prediction of synoptic-scale fronts. Wea. Forecasting, 34, 1137–1160, https://doi.org/10.1175/WAF-D-18-0183.1.
Lagerquist, R., J. Allen, and A. McGovern, 2020a: Climatology and variability of warm and cold fronts over North America from 1979 to 2018. J. Climate, 33, 6531–6554, https://doi.org/10.1175/JCLI-D-19-0680.1.
Lagerquist, R., A. McGovern, C. Homeyer, D. Gagne, and T. Smith, 2020b: Deep learning on three-dimensional multiscale data for next-hour tornado prediction. Mon. Wea. Rev., 148, 2837–2861, https://doi.org/10.1175/MWR-D-19-0372.1.
Long, J., E. Shelhamer, and T. Darrell, 2015: Fully convolutional networks for semantic segmentation. Conf. on Computer Vision and Pattern Recognition, Boston, MA, IEEE, https://doi.org/10.1109/CVPR.2015.7298965.
McGovern, A., R. Lagerquist, D. Gagne, G. Jergensen, K. Elmore, C. Homeyer, and T. Smith, 2019: Making the black box more transparent: Understanding the physical implications of machine learning. Bull. Amer. Meteor. Soc., 100, 2175–2199, https://doi.org/10.1175/BAMS-D-18-0195.1.
Mlawer, E., and D. Turner, 2016: Spectral radiation measurements and analysis in the ARM Program. The Atmospheric Radiation Measurement Program: The First 20 Years, Meteor. Monogr., No. 57, Amer. Meteor. Soc., https://doi.org/10.1175/AMSMONOGRAPHS-D-15-0027.1.
Mlawer, E., S. Taubman, P. Brown, M. Iacono, and S. Clough, 1997: Radiative transfer for inhomogeneous atmospheres: RRTM, a validated correlated-k model for the longwave. J. Geophys. Res., 102, 16 663–16 682, https://doi.org/10.1029/97JD00237.
Morrison, H., G. de Boer, G. Feingold, J. Harrington, M. Shupe, and K. Sulia, 2012: Resilience of persistent Arctic mixed-phase clouds. Nat. Geosci., 5, 11–17, https://doi.org/10.1038/ngeo1332.
Pincus, R., and B. Stevens, 2013: Paths to accuracy for radiation parameterizations in atmospheric models. J. Adv. Model. Earth Syst., 5, 225–233, https://doi.org/10.1002/jame.20027.
Racah, E., C. Beckham, T. Maharaj, S. Kahou, Prabhat, and C. Pal, 2017: ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. Advances in Neural Information Processing Systems, Long Beach, CA, NeurIPS, https://proceedings.neurips.cc/paper/2017/hash/519c84155964659375821f7ca576f095-Abstract.html.
Reichstein, M., G. Camps-Balls, B. Stevens, M. Jung, J. Denzler, and N. Carvalhais, 2019: Deep learning and process understanding for data-driven Earth system science. Nature, 566, 195–204, https://doi.org/10.1038/s41586-019-0912-1.
Ronneberger, O., P. Fischer, and T. Brox, 2015: U-net: Convolutional networks for biomedical image segmentation. Int. Conf. on Medical Image Computing and Computer-assisted Intervention, Munich, Germany, Technical University of Munich, https://doi.org/10.1007/978-3-319-24574-4_28.
Sadeghi, M., P. Nguyen, K. Hsu, and S. Sorooshian, 2020: Improving near real-time precipitation estimation using a U-net convolutional neural network and geographical information. Environ. Modell. Software, 134, 104856, https://doi.org/10.1016/j.envsoft.2020.104856.
Sha, Y., D. Gagne, G. West, and R. Stull, 2020a: Deep-learning-based gridded downscaling of surface meteorological variables in complex terrain. Part I: Daily maximum and minimum 2-m temperature. J. Appl. Meteor. Climatol., 59, 2057–2073, https://doi.org/10.1175/JAMC-D-20-0057.1.
Sha, Y., D. Gagne, G. West, and R. Stull, 2020b: Deep-learning-based gridded downscaling of surface meteorological variables in complex terrain. Part II: Daily precipitation. J. Appl. Meteor. Climatol., 59, 2075–2092, https://doi.org/10.1175/JAMC-D-20-0058.1.
Shanker, M., M. Hu, and M. Hung, 1996: Effect of data standardization on neural network training. Omega, 24, 385–397, https://doi.org/10.1016/0305-0483(96)00010-2.
Stamnes, K., S. Tsay, W. Wiscombe, and K. Jayaweera, 1988: Numerically stable algorithm for discrete-ordinate-method radiative transfer in multiple scattering and emitting layered media. Appl. Opt., 27, 2502–2509, https://doi.org/10.1364/AO.27.002502.
Stewart, J., C. Kumler, D. Hall, and M. Govett, 2020: Deep learning approach for the detection of areas likely for convection initiation. Conf. on Artificial Intelligence for Environmental Science, Boston, MA, Amer. Meteor. Soc., 4.5, https://ams.confex.com/ams/2020Annual/meetingapp.cgi/Paper/365670.
Stone, P., 1978: Constraints on dynamical transports of energy on a spherical planet. Dyn. Atmos. Oceans, 2, 123–139, https://doi.org/10.1016/0377-0265(78)90006-4.
Turner, D. D., and Coauthors, 2004: The QME AERI LBLRTM: A closure experiment for downwelling high spectral resolution infrared radiance. J. Atmos. Sci., 61, 2657–2675, https://doi.org/10.1175/JAS3300.1.
Turner, D. D., M. Shupe, and A. Zwink, 2018: Characteristic atmospheric radiative heating rate profiles in Arctic clouds as observed at Barrow, Alaska. J. Appl. Meteor. Climatol., 57, 953–968, https://doi.org/10.1175/JAMC-D-17-0252.1.
Wallace, J., and P. Hobbs, 2006: Atmospheric Science: An Introductory Survey. Vol. 2. Elsevier, 483 pp.
Wang, L., K. Scott, L. Xu, and D. Clausi, 2016: Sea ice concentration estimation during melt from dual-pol SAR scenes using deep convolutional neural networks: A case study. IEEE Trans. Geosci. Remote Sens., 54, 4524–4533, https://doi.org/10.1109/TGRS.2016.2543660.
Wimmers, A., C. Velden, and J. Cossuth, 2019: Using deep learning to estimate tropical cyclone intensity from satellite passive microwave imagery. Mon. Wea. Rev., 147, 2261–2282, https://doi.org/10.1175/MWR-D-18-0391.1.
Wood, R., 2012: Stratocumulus clouds. Mon. Wea. Rev., 140, 2373–2423, https://doi.org/10.1175/MWR-D-11-00121.1.
Zhou, Z., M. Siddiquee, N. Tajbakhsh, and J. Liang, 2020: Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging, 39, 1856–1867, https://doi.org/10.1109/TMI.2019.2959609.
U-nets are not the only type of CNN designed for pixelwise prediction. Other examples, in the encoder-decoder family along with U-nets, include convolutional autoencoders (Chen et al. 2017) and fully convolutional networks (Long et al. 2015).
A scalar predictor may be zenith angle, albedo, or one vector predictor at one height.
We have obtained RAP data from only 30 sites, because (i) the native RAP-output files are large and stored on a tape archive, which makes processing computationally slow; (ii) the 30 sites chosen are important for other NOAA projects, so the data will be reused; (iii) extracting millions of examples from 30 sites yields a large sample size at each site, as opposed to extracting millions of examples from thousands of sites. This allows us to robustly test the models’ generalization ability to each site in the testing data.
Even for the model from experiment 1, which is trained on nontropical sites and tested on tropical sites, the testing PRMSE is 0.108 K day−1.