1. Introduction
Characterizing the response of the Earth system to future emissions scenarios is key to informing climate change mitigation and adaptation strategies. Modern Earth system models (ESMs) are incredibly useful tools for this task, providing detailed climate projections far into the future for scenario-based analysis. However, as the process complexity and spatial resolution of such ESMs increases, so does their computational cost (Collins et al. 2012). Consequently, it becomes impractical to explore a wide range of possible emissions scenarios with modern ESMs, limiting their applicability to a restricted set of future scenarios (O’Neill et al. 2016).
To address this large computational cost, the climate modeling research community frequently makes use of emulators. Emulators are a set of statistical tools that aim to approximate the complex physical relationships in a full ESM at a fraction of the computational cost (Meinshausen et al. 2011; Tebaldi et al. 2020). Recent developments have demonstrated the considerable promise of machine learning (ML) techniques in these emulation tasks for both individual model components (Seifert and Rasp 2020; Silva et al. 2021; Mooers et al. 2021; Chantry et al. 2021; Lee et al. 2019) and entire ESM predictions (Rasp et al. 2020; Watson-Parris et al. 2022; Mansfield et al. 2020; Beusch et al. 2020). Such ML models often enable accurate predictions at greatly reduced computational cost relative to full ESMs.
The model classes investigated in state-of-the-art ML research in the climate sciences range widely and include linear models (Rasp 2020; Silva et al. 2020; Mansfield et al. 2020), tree-based methods (Yuval and O’Gorman 2020; Yuval et al. 2021; Silva et al. 2021), and various implementations of neural networks (Price and Rasp 2022; Bretherton et al. 2022; Krasnopolsky et al. 2005; Rasp and Thuerey 2021). This is broadly indicative of a very large potential design space for ML model architectures. Here, we investigate “randomly wired neural networks” in an effort to further explore this architecture design space for climate mode emulation. Randomly wired neural networks are a special class of neural network where components are connected in a random, rather than structured, manner. This is in direct contrast to widely used feed-forward multilayer perceptron (FFMLP)-style architectures, where components are connected in series. At its core, random wiring is a form of neural architecture search (NAS) (Xie et al. 2019; Kasim et al. 2021) that searches a more-complete space of connectivity patterns than other common NAS strategies (Elsken et al. 2019) to identify high-performing model architectures. This class of neural network introduces novel connectivity patterns between layers, rather than novel types of layers themselves, and they have demonstrated skill in a variety of domains, including handwriting recognition (Gelenbe and Yin 2016), internet network attack detection (Brun et al. 2018), and emulation of aerosol optics (Geiss et al. 2023).
In this work, we evaluate the suitability of randomly wired neural networks for climate emulation using the ClimateBench benchmarking dataset. We specifically investigate the use of random wiring to predict future temperature and precipitation statistics. To that end, we compare the performance of randomly wired networks with their serially connected counterparts within three types of neural network models covering a wide range of complexities: multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and convolutional long short-term memory (LSTM) networks (CNN-LSTMs). We find that random network wiring shows competitive results across model architectures and prediction tasks, with the greatest improvements (up to 30.4%) in MLPs. These performance improvements for randomly wired models decrease with increasing model architecture complexity, but of 24 model architecture, parameter count, and prediction task combinations, only one had a statistically significant performance deficit for randomly wired models. On the other hand, 14 had statistically significant performance improvement. Furthermore, we find that randomly wired neural networks take no longer to make predictions than their standard serially connected counterparts. Our work indicates that, in many cases, random wiring can serve as direct replacements for traditional feedforward neural networks to improve predictive skill.
The remainder of the paper is structured as follows. In section 2 we provide the details of how we construct our randomly wired neural networks and highlight their differences with more-standard networks. Section 3 describes the ClimateBench dataset and how we use it to train, validate, test, and evaluate our models. Our experimental setup to compare randomly wired neural networks with their conventional counterparts is discussed in section 4. Section 5 provides the results of these experiments, with further discussion and analysis in section 6. Section 7 concludes the paper with discussion of future work.
2. Randomly wired neural networks
A typical MLP neural network is structured such that information (a tensor) only flows in a single direction from one dense layer to the next. This is illustrated in Fig. 1, which shows a graph representation of an MLP with six hidden dense layers. In this graph representation, each layer is represented as a node in a directed graph where edges indicate the flow of tensors. We use this graph representation and terminology in part for consistency with previous work (Xie et al. 2019). In our random networks, as well as those of Xie et al. (2019) and Geiss et al. (2023), data still flow through the network in a feedforward manner. However, unlike dense layers in MLPs, dense layers in our random networks may receive inputs from any number of preceding layers and pass their outputs to any number of subsequent layers. That is, there may be “skip connections” between dense layers. An example of this class of random neural network, which we will henceforth refer to as “RandDense” networks, is illustrated in Fig. 1. With this in mind, we will detail our random network generation process and further highlight differences between our networks and typical MLPs.
Graph representations of (left) six-layer MLP and (right) six-layer RandDense networks shown side by side, with their respective node operations in the insets. Open circles represent hidden dense layers and their activation functions. Brown-filled circles represent the output layer discussed in section 2. Last, the upper rectangular block represents neural network layers preceding the dense layers. In this case, it is a simple input layer that performs no operation, but other choices such as a convolutional block are possible (see section 4). For the RandDense network, aggregation from three previous input nodes is done via weighted sum with weights w0, w1, and w2. The summation is followed by ReLU activation and the dense layer. Last, two identical copies of the output are sent to two separate nodes downstream.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
a. Network connectivity
Since the layers within our RandDense networks are randomly wired, the first step in generating such a network is determining which layers are connected to which, or the network connectivity. Following Geiss et al. (2023), given a fixed number of layers n, we randomly select the number of neurons per dense layer and generate an adjacency matrix representing the connections between layers in our RandDense network. Since our RandDense networks are still feedforward in nature, several constraints may be placed on the adjacency matrix. If we let each row represent a layer, and column values represent connections (inbound edges) from previous layers, then the adjacency matrix must be lower triangular. Thus, there are n(n + 1)/2 possible connections for an n-layer random network. For each randomly wired model, we generate an adjacency matrix from these possibilities by randomly assigning 1s and 0s to each entry of an n × n lower triangular matrix. We do not do any further filtering or selection on the connectivity patterns within the randomly wired models. The corresponding adjacency matrices for the networks in Fig. 1 are shown in Fig. 2. Some of these connectivity patterns may have layers without an inbound or outbound tensor. Thus, to ensure the network is valid, we follow Geiss et al. (2023) and iterate through each row and column, randomly activating an edge in each row/column if it has no active edges.
Graph representations of the (left) six-layer MLP and (right) six-layer RandDense networks from Fig. 1 shown side by side, with their respective adjacency matrices.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
b. Node operations
Nodes in a RandDense network may have multiple inbound or outbound edges, unlike the nodes of an MLP, which only have one of each (see Fig. 1). To treat this difference, we must define a new operation that occurs at every node.
The node operation of our RandDense networks begins with an aggregation of the incoming tensors. If a node has more than one inbound edge, we follow Xie et al. (2019) and aggregate the incoming tensors via a weighted sum with learnable, positive weights. The weights are kept positive by applying the sigmoid function. Exclusively summing inbound tensors instead of concatenating them (Geiss et al. 2023) requires a consistent number of neurons per layer throughout the network, but avoids extremely large input tensors that might occur using concatenation. In the style of Xie et al. (2019), our random networks have the ReLU activation function before the dense layer so that the outbound tensor can contain both positive and negative values. This avoids extremely large weighted sums when the number of inputs to a given layer is high. Figure 1 illustrates the differences between a node, which contains the dense layer and its activation, in a standard MLP and our random networks. Notice that a node in the graph representation of an MLP in Fig. 1 contains a dense layer followed by ReLU activation, while a node of the RandDense network in Fig. 1 contains a weighted summation followed by ReLU activation, then the dense layer. Last, if a node is connected to multiple downstream nodes, it will simply send out identical copies of its output along each outbound edge.
c. Input and output nodes
In our implementation of RandDense networks, each hidden layer has the same number of neurons. To accommodate smaller input vectors, the node immediately following the input layer will contain a number of neurons equal to the difference between the selected layer size and the input size. For example, if the selected layer size is 100 and the input size is 12, the first randomly wired node following the input layer will contain a dense layer of size 88. Then, this dense layer is concatenated with the input to create a layer of the correct size. This is done so that any nodes downstream of the first node may still have direct access to the input (Geiss et al. 2023).
For the output node, which may have multiple inbound edges, we simply take the average of all inbound tensors and send this value to a final dense layer of the desired output size. This final output node, which is colored brown in Fig. 1, has linear activation instead of ReLU so that both positive and negative values may be output. Our randomly wired neural networks are implemented using Tensorflow 2 in the Python programming language (Abadi et al. 2016), with code available on GitHub (see the data availability statement).
3. Data
Standard benchmarks are a valuable tool for the intercomparison of ML methods. The ClimateBench dataset, along with its associated evaluation criteria and metrics, seeks to provide a standardized framework for objectively evaluating ML-driven climate model emulators (Watson-Parris et al. 2022). We use this ClimateBench dataset to train, validate, and test our models in this work. The dataset is constructed from several simulations performed by Norwegian Earth System Model, version 2 (NorESM2; Seland et al. 2020), as part of the Scenario Model Intercomparison Project (ScenarioMIP; O’Neill et al. 2016), phase 6 of the Coupled Model Intercomparison Project (CMIP6; Eyring et al. 2016), and the Detection and Attribution Model Intercomparison Project (DAMIP; Gillett et al. 2016). ClimateBench provides four main anthropogenic forcing agents as predictors: carbon dioxide (CO2), sulfur dioxide (SO2), black carbon (BC), and methane (CH4), with the goal of predicting the annual mean surface air temperature (TAS), annual mean diurnal temperature range (max TAS − min TAS, or DTR), annual mean total precipitation (PR), and 90th percentile of daily precipitation (PR90) from 2015 to 2100. SO2, BC, and the predictands are provided as annual mean spatial distributions across 96 latitude × 144 longitude global grid points, and the longer-lived and well-mixed CO2 and CH4 inputs are provided as annual global total concentration and global average emissions, respectively. For each experiment, the ClimateBench dataset also includes the postprocessed output of three NorESM2 ensemble members that sample the internal variability of the model using different initial model states.
Our models are trained to predict one of TAS, DTR, PR, and PR90 following the ClimateBench benchmarking framework. Specifically, model inputs are global total concentrations of CO2 and global average CH4 emissions, as well as spatial distributions (96 × 144) of annual average BC and SO2 for a range of years from a given set of experiments. The models predict an annual average output variable value for each of the 96 × 144 grid points for each of the years 1850–2100, although they are evaluated on a smaller window as specified in the following sections.
a. Training and validation
Following the training, validation, and testing strategy in the original ClimateBench paper, we select the historical, SSP126, SSP370, and SSP585 experiments from CMIP6 (Eyring et al. 2016), as well as the hist-GHG and hist-aer experiments from DAMIP (Gillett et al. 2016) for training. Each historical experiment spans 1850–2014, and each of the SSP experiments contains data for 2015–2100, for a total of 753 yr. These experiments together cover a wide range of values for each of the input anthropogenic forcers and predicted output variables, making it an ideal suite on which to train our models.
In any machine learning application, it is good practice to reserve a portion of the training data for validation so that model performance may be monitored on both sets to prevent overfitting. Since climate data are highly correlated in time, it is recommended to select a continuous portion of the dataset as validation, rather than a random subset. We choose to use the first 2 yr of every decade from the historical, hist-GHG, hist-aer, SSP126, SSP370, and SSP585 training datasets as validation. Here, the validation set is used for early stopping, a form of regularization where training is halted once performance on the validation set stops improving.
b. Testing and evaluation
In addition to the validation set, a third test dataset is used as a final evaluation to ensure that neural network hyperparameter tuning or model selection has not overfit the validation data. As in the original ClimateBench paper, we select the SSP245 experiment from ScenarioMIP for the years 2080–2100 as our test dataset. This experiment lies between the extremes of the SSP126 and SSP585 experiments and represents a medium forcing and mitigation scenario.
4. Experimental setup
To explore the effects of random wiring, we first define three baseline model architectures that contain MLP dense layers. These models are meant to represent realistic deep learning architectures a climate scientist would use for emulation tasks and include a standard MLP, a convolutional neural network, and a convolutional-LSTM network. For each of these model architectures, we explore performance differences across the four output variables (TAS, DTR, PR, and PR90) described in section 3 at two different parameter counts within the dense layers of each respective network: 1 million and 10 million (hereinafter 1 M and 10 M, respectively). With each parameter count, we also explore performance for nine different numbers of hidden dense layers (2–10). Then, the standard MLP dense layers are replaced with randomly wired ones for comparison. We expect that the three baseline model architectures should cover a wide range of performances, illustrating how random wiring impacts performance when placed within both good and bad models. Example networks (with six hidden layers) for each of the three baseline models, along with example randomly wired variations, are shown in Fig. 3.
Each of the three baseline models with six MLP dense layers shown next to an example RandDense variation.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
a. MLP network
A standard MLP model is the most common and basic type of neural network that only contains dense layers. They have been extensively used in climate modeling tasks such as subgrid process representation (Rasp et al. 2018), rainfall downscaling (Tran Anh et al. 2019), longwave radiation emulation (Krasnopolsky et al. 2005), pan evaporation prediction (Ghorbani et al. 2018), and stochastic synthesis in hydrology simulations (Rozos et al. 2021).
We begin with the first parameter limit of 1 M. We generate MLP models for a given number of hidden layers by selecting a fixed layer size for all of the hidden layers such that the network’s parameter count is 1 M ± 10%. We generate 50 such MLP models with two hidden dense layers, 50 MLPs with three hidden dense layers, and so on up to 10 hidden dense layers. This gives 450 MLP models, each with different parameter initializations. The process is repeated for a parameter limit of 10 M, yielding a total of 900 MLPs. For the RandDense comparison networks, we repeat a similar process. Using the network generation method described in section 2, we generate 50 RandDense networks for 2–10 hidden layers at both the 1 M and 10 M parameter count, for a total of 900 RandDense networks. For small numbers of hidden layers, many of these models will have identical layer connectivity patterns since there are very few pattern choices. However, for larger numbers of hidden layers, many of the generated models will have unique wirings.
As discussed in section 3, the inputs to the models are global CO2 and CH4, as well as 96 × 144 grids of SO2 and BC. However, in MLPs, each input is handled by one neuron in the input layer. As such, if we directly fed the anthropogenic forcer inputs to the model, the input layer would be over 26 000 neurons wide. This is an extremely high layer size, so we follow Watson-Parris et al. (2022) and perform dimensionality reduction on the SO2 and BC inputs. Specifically, we use the first five empirical orthogonal functions (EOFs) of each field. Thus, the total input size to the MLP is 12: the first five EOFs for SO2 and BC, plus global CO2 and CH4. The output of the MLP is a flattened map of the predicated variable, which is then reshaped to the correct grid for final predictions.
Both the MLPs and RandDense networks are trained with a batch size of 25 and use the Adam optimizer (Kingma and Ba 2014) and mean-square error loss for 100 epochs. We also use early stopping during training with a patience of 10 epochs.
b. Convolutional network
CNNs have been widely adopted in object tracking and image recognition tasks because they are able to model spacial dependencies (Li et al. 2021; O’Shea and Nash 2015). CNNs have also been recently adopted for various weather and climate tasks such as precipitation nowcasting (Trebing et al. 2021), bow echo detection (Mounier et al. 2022), climate zone classification (Yoo et al. 2019), and ocean modeling (Nikolaev et al. 2020).
Following a similar process as in section 4a, we generate 50 MLPs and 50 RandDense networks with 2–10 hidden dense layers for both 1 M and 10 M parameter counts. However, rather than being standalone networks, these models are appended to a convolutional block. This block consists of a single convolutional layer with 20 filters, a kernal size of 3, ReLU activation, and L2 regularization (Cortes et al. 2012). The convolutional layer is followed by global average pooling.
Since CNNs are designed to handle images with multiple channels, we preserve the original dimensionality of the input variables, unlike the transformation we conducted for the standalone MLP and RandDense networks in section 4a. Specifically, we are able to feed the 96 × 144 maps of SO2 and BC to the convolutional network, as well as global CO2 and CH4. To treat the four input variables as four channels of one 96 × 144 “image” of the globe, the CO2 and CH4 data are transformed to 96 × 144 grids, where each grid cell has the same value.
The CNNs are followed by either conventional MLP or RandDense blocks, which we will simply refer to as “CNN networks,” and “CNN RandDense networks.” The CNNs use a batch size of 25 and are trained with the Adam optimizer using the mean-square error loss function for 100 epochs. We again use early stopping during training with a patience of 10 epochs.
c. Convolutional-LSTM network
LSTM networks are a type of recurrent neural network (RNN) that model temporal dependencies. This time-aware property makes them useful for a range of forecasting tasks such as precipitation downscaling (Tran Anh et al. 2019), soil moisture estimation (Ahmed et al. 2021), and flood forecasting (Liu et al. 2020). For the specific task of climate model emulation, a combined CNN-LSTM model has been shown to outperform CNN and LSTM models in isolation (Watson-Parris et al. 2022).
As in sections 4a and 4b, for this architecture we generate 50 MLPs and 50 RandDense networks with 2–10 hidden layers for 1 M and 10 M parameter counts. These models are then appended to a CNN-LSTM block. This block first consists of the same convolutional, average pooling, and global average pooling layers as in section 4b, but with each of them time distributed (applied in the same way) across every 10-yr time window within the training samples. These time-distributed layers are followed by an LSTM with 25 units.
To enable the CNN-LSTM block to make time-aware predictions, the training data are transformed slightly by slicing them into 10-yr moving windows with a 1-yr stride. For example, two such windows might be 2015–24 and 2016–25. This results in each historical dataset losing 9 data points because the last starting year of the 10-yr window is 2005 instead of 2014. Thus, the CNN-LSTM dataset contains 753 − (3 × 9) = 726 data points.
The CNN-LSTMs are followed by either conventional MLP or RandDense blocks, which we will simply refer to as “CNN-LSTM networks,” and “CNN-LSTM RandDense networks.” We train them with a batch size of 25 using the Adam optimizer and mean-square error loss for 100 epochs. We also use early stopping with a patience of five epochs.
5. Results
A summary of our experimental results is shown in Table 1, which shows the best total RMSE (NRMSEt) performance for each model class across all generated models, including all numbers of hidden layers and both parameter counts. For nearly every model architecture and predicted variable, the RandDense variations outperformed their standard counterparts, the two exceptions being the TAS prediction task in both convolutional models. In general, the best performance differences were starker for the precipitation variables (PR and PR90). Additionally, the margin of difference for instances when the RandDense networks performed better were generally larger than those when the standard networks performed better. A more-detailed breakdown is shown in Table 2, which shows the relative total RMSE performance improvement provided by the addition of random wiring. There are a few key trends. First, we see that the performance improvement provided by the addition of random wiring is most stark for MLPs with 1 M parameters, but decreases with increasing model architecture complexity, with some variety with respect to parameter count. We hypothesize that this is due to the fact that the randomly wired layers in the more-complex convolutional architectures are farther removed from the input, so they do less direct computation on the input, as opposed to the CNN or LSTM, which directly handle spatial and temporal dependencies in the data directly. Second, there appears to be a more-consistent statistically significant benefit provided by random wiring for the precipitation prediction tasks. Last, while not every performance change was statistically significant at the α = 0.05 significance level, there were far more instances of statistically significant performance improvements provided by random wiring than deficits. In fact, there was only one instance (MLPs with 10 M parameters predicting DTR) in which the addition of random wiring incurred a statistically significant performance loss, as compared with the 14 cases in which random wiring showed a statistically significant performance improvement.
Best total RMSE performance for each model class and predicted variable across all generated models, along with the original CNN-LSTM and pattern-scaling models from Watson-Parris et al. (2022). Lower is better, and the better RMSE between the standard and RandDense models is highlighted with boldface type.
Relative change in best RMSE performance with the addition of random wiring averaged over all generated models for each architecture, parameter count, and predicted variable. Statistically significant changes (p < 0.05) are in boldface type.
When compared with the CNN-LSTM model from the original ClimateBench paper (Watson-Parris et al. 2022) (see Table 1), we find that the CNN-LSTM RandDense models outperform across all prediction tasks, with the greatest improvements for temperature variables. As in Watson-Parris et al. (2022), the CNN-LSTM models also consistently outperform a linear pattern-scaling approach that uses independent linear regressions to predict each variable within each grid box. Additional results, including best performances for spatial and global RMSE, can be found in the appendix. The following sections are dedicated to discussing the experimental results for each of the baseline model architectures in greater detail.
a. MLP
A summary of our MLP and RandDense experiments is shown in Fig. 4. Broadly, we find that, on average, RandDense networks outperform their standard counterparts at 1 M parameters with mixed results at 10 M, especially for the DTR prediction task. Points below the black y = x line indicate that the RandDense networks outperformed the MLP networks on average, and vice versa for points above. The farther that a point is from the y = x line, the more drastic is the mean performance difference. For the TAS prediction task, the RandDense networks had better mean RMSE performance across nearly all hidden layer and parameter counts, with the largest mean performance differences being between MLP and RandDense networks with 1 M parameters. For the models with 1 M parameters, both the standard and RandDense variations perform worse with more hidden layers, but, at 10 M parameters, the trend reverses for standard networks and vanishes for the RandDense networks. For the DTR prediction task, the RandDense models performed better than the MLPs at 1 M parameters, and vice versa at 10 M parameters. For both parameter counts, mean performance decreases as the number of hidden layers increases. Trends for both precipitation variables are quite similar. In general, the RandDense networks outperform their MLP counterparts, with some exceptions for networks with high numbers of hidden layers at 10 M parameters. For the MLPs with 1 M parameters, performance decreased slightly with higher numbers of hidden layers, but increased slightly with more layers for 10 M-parameter MLPs. Across both parameter counts, performance for RandDense networks decreased as more layers were added.
Mean total RMSE of 50 MLP models vs mean total RMSE of 50 RandDense models for TAS, DTR, PR, and PR90. The color heatmaps to the right indicate the number of hidden layers. Error bars show ± the standard error of the mean.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
b. CNN
A summary of our CNN and CNN RandDense experiments is shown in Fig. 5. In general, the CNN RandDense networks outperform the CNNs for precipitation variables, with mixed results for the temperature variables. For the TAS prediction task, the CNN RandDense networks have overall better mean performance than their standard counterparts at 10 M parameters and vice versa at 1 M parameters. Interestingly, this pattern is reversed for the DTR prediction task. At 1 M parameters, both the CNN and CNN RandDense models had increased TAS mean performance at higher numbers of layers, with the opposite trend at 10 M parameters. For DTR, however, neither network architecture showed much change in mean performance with different network depths.
As in Fig. 4, but for 50 CNN models vs 50 CNN RandDense models.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
Like the MLP/RandDense networks, the trends in CNN/CNN RandDense mean performance for both precipitation prediction tasks are generally similar. With few exceptions, the CNN RandDense networks outperform their standard counterparts, with the largest mean performance differences at 10 M parameters. Furthermore, the mean performance gains for the RandDense models in the PR prediction task are generally greater than in the PR90 prediction task. Mean performance with respect to the number of hidden layers shows no clear trends across prediction variable or number of parameters.
c. CNN-LSTM
A summary of our CNN-LSTM and CNN-LSTM RandDense experiments is shown in Fig. 6. Broadly, the CNN-LSTM RandDense networks outperform, on average, their standard counterparts for precipitation prediction tasks, but show mixed average performance for temperature prediction tasks. Unlike the previously discussed model architectures, the RandDense variations of CNN-LSTM models generally perform worse than their standard counterparts in the TAS prediction task, regardless of network depth or number of parameters. For the DTR prediction task, trends in mean performance differ by parameter count. At 1 M parameters, CNN-LSTM models have slightly better mean performance, with the difference becoming less stark as the number of layers increases. At 10 M parameters, the CNN-LSTM RandDense models have generally better mean performance, with larger performance gains at higher numbers of layers.
As in Fig. 4, but for 50 CNN-LSTM models vs 50 CNN-LSTM RandDense models.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
For both precipitation variables, the CNN-LSTM RandDense networks outperform their standard counterparts with few exceptions across all network depths and parameter counts. Mean performance differences are most stark for the PR90 prediction task at high numbers of layers and 10 M parameters. For both PR and PR90, the CNN-LSTM RandDense networks have slight performance gains for deeper networks at 1 M parameters and slight performance losses for deeper networks at 10 M parameters.
Since the CNN-LSTM RandDense model had the best performance overall (see Table 1), we further analyze its difference from the NorESM2 ground truth predictions. Figure 7 shows the NorESM2 predictions alongside the best CNN-LSTM RandDense predictions, as well as the mean difference between the two. Across prediction tasks, the mean difference in predictions during the testing period (2080–2100) was statistically insignificant in most locations. Furthermore, significant differences, when present, were relatively small. For example, about 54% of the significant differences in TAS prediction are within 0.1 K (∼4.5% error), and 90% of significant differences are within 0.22 K (∼11.8% error).
(left) NorESM2 ground truth predictions for each variable averaged from 2080 to 2100, alongside (center) the predictions averaged over the same time period of the best-performing CNN-LSTM RandDense models. (right) The mean difference between the NorESM2 and CNN-LSTM RandDense predictions, once again averaged over the 2080–2100 testing period. Statistically insignificant differences (p > 0.05) are masked.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
In general, the CNN-LSTM RandDense model has lower RMSE over the ocean than over land for temperature variables. This is especially true for DTR. Among statistically significant errors (see Fig. 7), the CNN-LSTM RandDense model had an average DTR error of 0.05 K over sea as compared with 0.17 K over land. Additionally, it tends to overestimate warming in the Northern Hemisphere but underestimate it in the Southern Hemisphere. For the precipitation variables, the model generally makes better predictions over land. The area in which the models make the worst predictions is the intertropical convergence zone (ITCZ) between South America and Africa. Specifically, the model underestimates both average and extreme precipitation over the ITCZ’s winter location (Schneider et al. 2014). We hypothesize that it struggles in this region because of variability in climate change–induced ITCZ shift across the simulations on which the model was trained.
6. Discussion
Each model architecture experiment set shares a common theme across the experimental results presented here: there is at least one performance metric and prediction task where randomization provides clear benefits. For example, even though randomization did not provide clear mean DTR performance benefits, Table 1 shows that the best performance metric favored the RandDense models across all tested architectures. This suggests that for the task of climate model emulation, scientists and practitioners are likely to achieve some performance benefit from generating a limited number of randomly wired variations of their existing models with a similar number of parameters. In scenarios where finely tuned and optimized model predictive skill is of critical importance, investigating network randomization is an important design choice to consider. With this in mind, the remainder of this section is dedicated to further exploring random networks and why they might achieve the results they do.
a. Prediction speed
The speed with which neural networks make their predictions is important for both standalone models and models within a larger system. Researchers seeking to quickly construct, validate, and test standalone models for multiple applications would clearly benefit from faster models. Similarly, researchers seeking to use a neural network to model a subprocess in a large climate model (Irrgang et al. 2021; Holder et al. 2022; Silva et al. 2021; Geiss et al. 2023) would also benefit from quicker predictions, since such submodels are often called many thousands of times per time step in a full Earth system model. As such, we investigate the prediction speeds of randomly wired neural networks and compare them with their feedforward counterparts.
For fair comparison, we took a best-performing standard model, say a six-layer MLP, and the best-performing RandDense equivalent and used them to make predictions on the test data 10 times. This is done because the time required for just one prediction is quite small. We repeated nine more times for a total of 10 trials. Last, we conducted an independent sample t test on the two samples of 10 run times to investigate whether the mean run times of the two models were different in a statistically significant way.
Across the several different model architectures, layer counts, and prediction tasks, we find no consistent difference between the run times of the standard models and their randomly wired variations. In a large majority of comparisons, the difference was statistically insignificant. This is partially illustrated in Fig. 8, which shows 95% confidence intervals for PR prediction speed across model architectures. Notice that in nearly every case, the confidence intervals for the standard and RandDense models overlap. Furthermore, neither the standard nor RandDense network consistently had faster average runtimes. We conclude that random wiring neither benefits nor detracts from prediction speed, making RandDense networks suitable replacements for standard MLPs in either standalone or submodel applications.
The average PR prediction times for the best-performing models within each model class, along with 95% confidence intervals.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
b. Sensitivity to the node operation
The node operation of the random networks discussed in section 2 and illustrated in Fig. 1 was a specific design choice by the authors, inspired by the choices of Xie et al. (2019) and Geiss et al. (2023). There are many valid alternative designs, however. Here, we investigate two modifications to our node operation:
-
placing the ReLU activation after the dense layer instead of before and
-
replacing the weighted sum with a simple unweighted sum.
With the ReLU activation after the dense layer, we found different results across the three model architectures. RandDense network performance with 1 M parameters decreased for all prediction tasks by about 20%, but in many cases the RandDense networks still outperformed their standard counterparts. The same was observed for networks with 10 M parameters, except when predicting DTR, for which RandDense performance improved by 20%. For the CNN RandDense models, there were mixed mean performance changes for the temperature variables, with an increase of just a few percentage points at 1 M parameters and the opposite at 10 M parameters. There was little change in performance (<1%) predicting precipitation. Last, for the CNN-LSTM RandDense models, when the ReLU activation was placed after the dense layer, there was a 6% improvement TAS prediction performance, 8% decrease in DTR prediction performance, and limited change in performance (<1%) for both precipitation variables.
Removing the weighted sum and replacing it with a simple unweighted sum made no significant performance difference across all model types, predicted variables, layer counts, or parameter counts. These performance changes, or lack thereof, as a result of node operation modifications highlight the importance of specific components of the operation.
c. Further investigating the effects of random wiring
We hypothesize that random wiring may affect a neural network’s performance by either
-
benefiting/hindering the training of the full model or
-
processing the output of the previous layers in a better/worse way.
To explore these hypotheses, we attempted to isolate the source of performance benefits or deficits for randomly wired neural networks. Specifically, for the CNN and CNN-LSTM architectures, we trained the best-performing conventional models from scratch. We then isolated the trained CNN or CNN-LSTM block and froze its weights. Last, we attached to this block the best-performing randomly wired architecture with randomly initialized weights and retrained the model. This is a similar idea to that of previous work that configured artificially evolved nanoparticle systems into Boolean logic gates (Bose et al. 2015). Since the CNN or CNN-LSTM block has been frozen, only the weights in the randomly wired layers are trained in this iteration. If the performance of this new model was different from the original random model, which had both its convolutional block and random layers trained from scratch, then it would indicate that random wiring was beneficial or detrimental to the training process. On the other hand, if the frozen model and the original random model performed the same, it would indicate that the randomly wired layers are helping or hurting the model in the way they process the output of the CNN/CNN-LSTM block.
We repeated this experimental procedure 10 times for both the CNN and CNN-LSTM, analyzing both 1 M and 10 M parameter counts at 2, 6, and 10 layers. All four prediction tasks were analyzed for this experiment. Figure 9 shows a summary of the results for the CNN RandDense networks. The figure illustrating the CNN-LSTM RandDense results may be found in the appendix. Overall, we find no consistent trend in performance differences between the RandDense models trained from scratch and those with frozen CNN or CNN-LSTM blocks, with only one exception: the DTR prediction task for CNN networks. While the confidence intervals for RandDense networks and their frozen counterparts may not always overlap, neither performs better in general. This provides evidence that random wiring impacts performance due to the way it handles the output of the convolutional block, and not by it affecting the CNN’s training process.
Mean RMSE performance of 10 CNN RandDense networks vs mean RMSE performance of 10 CNN RandDense networks with frozen weights from the best-performing standard CNN network. The x axis indicates the number of hidden layers and parameters. For example, “6, 1M” means six hidden layers and 1 million parameters.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
7. Conclusions
Motivated by the promise of randomly wired neural networks in related applications, we explore random wirings between dense layers of neural networks for the task of climate model emulation. We replaced the traditional feedforward dense layers in several model architectures with our randomly wired networks, coined “RandDense” networks, and conducted performance experiments using the ClimateBench dataset. Across several architectures, model complexities, and predicted variables, we find performance benefits for models containing randomly wired layers, indicating that in many cases, standard feedforward networks may be effectively replaced with RandDense networks, which achieve better performance at no additional computational cost.
Additionally, we explore various node operations for RandDense networks and conduct preliminary experiments to understand their performance (section 6). However, this leaves the door open for future work. Many more node operation modifications, such as the inclusion of batch normalization and various activation functions, remain to be evaluated. Furthermore, our method of generating random graph structures described in section 2 is only one way of accomplishing the task. Special defined types of random graphs have been explored in other applications (Xie et al. 2019), but assessing their efficacy for climate model emulation remains a topic for future work. Last, while the experiments in section 6 provide a brief exploration into why RandDense networks perform better or worse in certain scenarios, there may be other reasons why RandDense networks perform differently than their standard counterparts, such as by alleviating the vanishing gradient problem (Orhan and Pitkow 2017). Further analysis into RandDense networks may shed light on additional reasons behind their performance discrepancies.
Acknowledgments.
The authors thank Joseph Hardin for helpful discussions and feedback. Author Watson-Parris acknowledges funding from the European Union’s Horizon 2020 research and innovation programme iMIRACLI under Marie Skłodowska-Curie Grant Agreement 860100. Author Geiss is partially supported by the “Enabling Aerosol–Cloud Interactions at Global Convection-Permitting Scales (EAGLES)” project (project 74358), sponsored by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Earth System Model Development (ESMD) program area.
Data availability statement.
The ClimateBench data are available online (https://doi.org/10.5281/zenodo.5196512). The code used to generate, train, and evaluate our models can also be found online (https://github.com/yikwill/randomly-wired-nn).
APPENDIX
Additional Results
In this section, we present additional results from our RMSE and analytical experiments. Tables A1 and A2 show the best spatial, global, and total RMSE performance of each model at each parameter count tested. Figure A1 shows the results of the weight-freezing analysis experiment for the CNN-LSTM RandDense architecture.
Best spatial, global, and total RMSE performance for each model class and predicted variable across all network depths at 1 M parameters, along with the original CNN-LSTM model from Watson-Parris et al. (2022). Lower is better, and the better RMSE between the standard and RandDense models is highlighted in boldface type.
As in Fig. 9, but for 10 CNN-LSTM RandDense networks vs 10 CNN-LSTM RandDense networks, with frozen weights from the best-performing standard CNN-LSTM network.
Citation: Artificial Intelligence for the Earth Systems 2, 4; 10.1175/AIES-D-22-0088.1
REFERENCES
Abadi, M., and Coauthors, 2016: TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv, 1603.04467v2, https://doi.org/10.48550/arXiv.1603.04467.
Ahmed, A., R. C. Deo, A. Ghahramani, N. Raj, Q. Feng, Z. Yin, and L. Yang, 2021: LSTM integrated with Boruta-random forest optimiser for soil moisture estimation under RCP4. 5 and RCP8. 5 global warming scenarios. Stochastic Environ. Res. Risk Assess., 35, 1851–1881. https://doi.org/10.1007/s00477-021-01969-3.
Beusch, L., L. Gudmundsson, and S. I. Seneviratne, 2020: Emulating Earth system model temperatures with MESMER: From global mean temperature trajectories to grid-point-level realizations on land. Earth Syst. Dyn., 11, 139–159, https://doi.org/10.5194/esd-11-139-2020.
Bose, S. K., C. P. Lawrence, Z. Liu, K. Makarenko, R. M. J. van Damme, H. J. Broersma, and W. G. van der Wiel, 2015: Evolution of a designless nanoparticle network into reconfigurable Boolean logic. Nat. Nanotechnol., 10, 1048–1052, https://doi.org/10.1038/nnano.2015.207.
Bretherton, C. S., and Coauthors, 2022: Correcting coarse-grid weather and climate models by machine learning from global storm-resolving simulations. J. Adv. Model. Earth Syst., 14, e2021MS002794, https://doi.org/10.1029/2021MS002794.
Brun, O., Y. Yin, E. Gelenbe, Y. M. Kadioglu, J. Augusto-Gonzalez, and M. Ramos, 2018: Deep learning with dense random neural networks for detecting attacks against IoT-connected home environments. First International ISCIS Security Workshop 2018, E. Gelenbe et al., Eds., Communications in Computer and Information Science, Vol. 821, Springer, 79–89, https://doi.org/10.1007/978-3-319-95189-8.
Chantry, M., S. Hatfield, P. Dueben, I. Polichtchouk, and T. Palmer, 2021: Machine learning emulation of gravity wave drag in numerical weather forecasting. J. Adv. Model. Earth Syst., 13, e2021MS002477, https://doi.org/10.1029/2021MS002477.
Collins, M., R. E. Chandler, P. M. Cox, J. M. Huthnance, J. Rougier, and D. B. Stephenson, 2012: Quantifying future climate change. Nat. Climate Change, 2, 403–409, https://doi.org/10.1038/nclimate1414.
Cortes, C., M. Mohri, and A. Rostamizadeh, 2012: L2 regularization for learning kernels. arXiv, 1205.2653v1, https://doi.org/10.48550/arXiv.1205.2653.
Elsken, T., J. H. Metzen, and F. Hutter, 2019: Neural architecture search: A survey. J. Mach. Learn. Res., 20, 1–21, https://doi.org/10.48550/arXiv.1808.05377.
Eyring, V., S. Bony, G. A. Meehl, C. A. Senior, B. Stevens, R. J. Stouffer, and K. E. Taylor, 2016: Overview of the Coupled Model Intercomparison Project phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev., 9, 1937–1958, https://doi.org/10.5194/gmd-9-1937-2016.
Geiss, A., P.-L. Ma, B. Singh, and J. C. Hardin, 2023: Emulating aerosol optics with randomly generated neural networks. Geosci. Model Dev., 16, 2355–2370, https://doi.org/10.5194/gmd-16-2355-2023.
Gelenbe, E., and Y. Yin, 2016: Deep learning with random neural networks. 2016 Int. Joint Conf. on Neural Networks (IJCNN), Vancouver, BC, Canada, Institute of Electrical and Electronics Engineers, 1633–1638, https://doi.org/10.1109/IJCNN.2016.7727393.
Ghorbani, M. A., R. C. Deo, Z. M. Yaseen, M. H. Kashani, and B. Mohammadi, 2018: Pan evaporation prediction using a hybrid Multilayer Perceptron-Firefly Algorithm (MLP-FFA) model: Case study in north Iran. Theor. Appl. Climatol., 133, 1119–1131, https://doi.org/10.1007/s00704-017-2244-0.
Gillett, N. P., and Coauthors, 2016: The Detection and Attribution Model Intercomparison Project (DAMIP v1.0) contribution to CMIP6. Geosci. Model Dev., 9, 3685–3697, https://doi.org/10.5194/gmd-9-3685-2016.
Holder, C., A. Gnanadesikan, and M. Aude-Pradal, 2022: Using neural network ensembles to separate ocean biogeochemical and physical drivers of phytoplankton biogeography in Earth system models. Geosci. Model Dev., 15, 1595–1617, https://doi.org/10.5194/gmd-15-1595-2022.
Irrgang, C., N. Boers, M. Sonnewald, E. A. Barnes, C. Kadow, J. Staneva, and J. Saynisch-Wagner, 2021: Towards neural Earth system modelling by integrating artificial intelligence in Earth system science. Nat. Mach. Intell., 3, 667–674, https://doi.org/10.1038/s42256-021-00374-3.
Kasim, M., and Coauthors, 2021: Building high accuracy emulators for scientific simulations with deep neural architecture search. Mach. Lear. Sci. Technol., 3, 015013, https://doi.org/10.1088/2632-2153/ac3ffa.
Kingma, D. P., and J. Ba, 2014: Adam: A method for stochastic optimization. arXiv, 1412.6980v9, https://doi.org/10.48550/arXiv.1412.6980.
Krasnopolsky, V. M., M. S. Fox-Rabinovitz, and D. V. Chalikov, 2005: New approach to calculation of atmospheric model physics: Accurate and fast neural network emulation of longwave radiation in a climate model. Mon. Wea. Rev., 133, 1370–1383, https://doi.org/10.1175/MWR2923.1.
Lee, C. S., E. Sohn, J. D. Park, and J.-D. Jang, 2019: Estimation of soil moisture using deep learning based on satellite data: A case study of South Korea. GIsci. Remote Sens., 56, 43–67, https://doi.org/10.1080/15481603.2018.1489943.
Li, Z., F. Liu, W. Yang, S. Peng, and J. Zhou, 2021: A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst., 33, 6999–7019, https://doi.org/10.1109/TNNLS.2021.3084827.
Liu, M., Y. Huang, Z. Li, B. Tong, Z. Liu, M. Sun, F. Jiang, and H. Zhang, 2020: The applicability of LSTM-KNN model for real-time flood forecasting in different climate zones in China. Water, 12, 440, https://doi.org/10.3390/w12020440.
Mansfield, L. A., P. J. Nowack, M. Kasoar, R. G. Everitt, W. J. Collins, and A. Voulgarakis, 2020: Predicting global patterns of long-term climate change from short-term simulations using machine learning. npj Climate Atmos. Sci., 3, 44, https://doi.org/10.1038/s41612-020-00148-5.
Meinshausen, M., S. C. B. Raper, and T. M. L. Wigley, 2011: Emulating coupled atmosphere-ocean and carbon cycle models with a simpler model, MAGICC6—Part 1: Model description and calibration. Atmos. Chem. Phys., 11, 1417–1456, https://doi.org/10.5194/acp-11-1417-2011.
Mooers, G., M. Pritchard, T. Beucler, J. Ott, G. Yacalis, P. Baldi, and P. Gentine, 2021: Assessing the potential of deep learning for emulating cloud superparameterization in climate models with real-geography boundary conditions. J. Adv. Model. Earth Syst., 13, e2020MS002385, https://doi.org/10.1029/2020MS002385.
Mounier, A., L. Raynaud, L. Rottner, M. Plu, P. Arbogast, M. Kreitz, L. Mignan, and B. Touzé, 2022: Detection of bow echoes in kilometer-scale forecasts using a convolutional neural network. Artif. Intell. Earth Syst., 1, e210010, https://doi.org/10.1175/AIES-D-21-0010.1.
Nikolaev, A., I. Richter, and P. Sadowski, 2020: Deep learning for climate models of the Atlantic Ocean. AAAI Spring Symp.: MLPS, 2020, Alexandria, VA, National Science Foundation, https://par.nsf.gov/biblio/10273992.
O’Neill, B. C., and Coauthors, 2016: The Scenario Model Intercomparison Project (ScenarioMIP) for CMIP6. Geosci. Model Dev., 9, 3461–3482, https://doi.org/10.5194/gmd-9-3461-2016.
Orhan, A. E., and X. Pitkow, 2017: Skip connections eliminate singularities. arXiv, 1701.09175v8, https://doi.org/10.48550/arXiv.1701.09175.
O’Shea, K., and R. Nash, 2015: An introduction to convolutional neural networks. arXiv, 1511.08458v2, https://doi.org/10.48550/arXiv.1511.08458.
Price, I., and S. Rasp, 2022: Increasing the accuracy and resolution of precipitation forecasts using deep generative models. Proc. 25th Int. Conf. on Artificial Intelligence and Statistics, Online, PMLR, 151, 10 555–10 571, https://proceedings.mlr.press/v151/price22a.html.
Rasp, S., 2020: Coupled online learning as a way to tackle instabilities and biases in neural network parameterizations: General algorithms and Lorenz 96 case study (v1.0). Geosci. Model Dev., 13, 2185–2196, https://doi.org/10.5194/gmd-13-2185-2020.
Rasp, S., and N. Thuerey, 2021: Data-driven medium-range weather prediction with a Resnet pretrained on climate simulations: A new model for WeatherBench. J. Adv. Model. Earth Syst., 13, e2020MS002405, https://doi.org/10.1029/2020MS002405.
Rasp, S., M. S. Pritchard, and P. Gentine, 2018: Deep learning to represent subgrid processes in climate models. Proc. Natl. Acad. Sci. USA, 115, 9684–9689, https://doi.org/10.1073/pnas.1810286115.
Rasp, S., P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, 2020: WeatherBench: A benchmark data set for data-driven weather forecasting. J. Adv. Model. Earth Syst., 12, e2020MS002203, https://doi.org/10.1029/2020MS002203.
Rozos, E., P. Dimitriadis, K. Mazi, and A. D. Koussis, 2021: A multilayer perceptron model for stochastic synthesis. Hydrology, 8, 67, https://doi.org/10.3390/hydrology8020067.
Schneider, T., T. Bischoff, and G. H. Haug, 2014: Migrations and dynamics of the intertropical convergence zone. Nature, 513, 45–53, https://doi.org/10.1038/nature13636.
Seifert, A., and S. Rasp, 2020: Potential and limitations of machine learning for modeling warm-rain cloud microphysical processes. J. Adv. Model. Earth Syst., 12, e2020MS002301, https://doi.org/10.1029/2020MS002301.
Seland, Ø., and Coauthors, 2020: Overview of the Norwegian Earth System Model (NorESM2) and key climate response of CMIP6 DECK, historical, and scenario simulations. Geosci. Model Dev., 13, 6165–6200, https://doi.org/10.5194/gmd-13-6165-2020.
Silva, S. J., C. L. Heald, and A. B. Guenther, 2020: Development of a reduced-complexity plant canopy physics surrogate model for use in chemical transport models: A case study with GEOS-Chem v12.3.0. Geosci. Model Dev., 13, 2569–2585, https://doi.org/10.5194/gmd-13-2569-2020.
Silva, S. J., P.-L. Ma, J. C. Hardin, and D. Rothenberg, 2021: Physically regularized machine learning emulators of aerosol activation. Geosci. Model Dev., 14, 3067–3077, https://doi.org/10.5194/gmd-14-3067-2021.
Tebaldi, C., A. Armbruster, H. P. Engler, and R. Link, 2020: Emulating climate extreme indices. Environ. Res. Lett., 15, 074006, https://doi.org/10.1088/1748-9326/ab8332.
Tran Anh, D., S. P. Van, T. D. Dang, and L. P. Hoang, 2019: Downscaling rainfall using deep learning long short-term memory and feedforward neural network. Int. J. Climatol., 39, 4170–4188, https://doi.org/10.1002/joc.6066.
Trebing, K., T. Staǹczyk, and S. Mehrkanoon, 2021: SmaAt-UNet: Precipitation nowcasting using a small attention-UNet architecture. Pattern Recognit. Lett., 145, 178–186, https://doi.org/10.1016/j.patrec.2021.01.036.
Watson-Parris, D., and Coauthors, 2022: ClimateBench v1.0: A benchmark for data-driven climate projections. J. Adv. Model. Earth Syst., 14, e2021MS002954, https://doi.org/10.1029/2021MS002954.
Xie, S., A. Kirillov, R. Girshick, and K. He, 2019: Exploring randomly wired neural networks for image recognition. Proc. IEEE/CVF Int. Conf. on Computer Vision, Seoul, South Korea, Institute of Electrical and Electronics Engineers, 1284–1293, https://doi.org/10.1109/ICCV.2019.00137.
Yoo, C., D. Han, J. Im, and B. Bechtel, 2019: Comparison between convolutional neural networks and random forest for local climate zone classification in mega urban areas using Landsat images. ISPRS J. Photogramm. Remote Sens., 157, 155–170, https://doi.org/10.1016/j.isprsjprs.2019.09.009.
Yuval, J., and P. A. O’Gorman, 2020: Stable machine-learning parameterization of subgrid processes for climate modeling at a range of resolutions. Nat. Commun., 11, 3295, https://doi.org/10.1038/s41467-020-17142-3.
Yuval, J., P. A. O’Gorman, and C. N. Hill, 2021: Use of neural networks for stable, accurate and physically consistent parameterization of subgrid atmospheric processes with good performance at reduced precision. Geophys. Res. Lett., 48, e2020GL091363, https://doi.org/10.1029/2020GL091363.