1. Introduction
Climate networks are constructed to find complex structures such as teleconnections (Boers et al. 2019), clusters (Rheinwalt et al. 2015), hubs, regime transitions (Fan et al. 2018), and bottlenecks (Donges et al. 2009a) in the climatic system. Networkbased approaches have shown considerable improvements in the prediction of several climate phenomena (Ludescher et al. 2021) such as El Niño events (Ludescher et al. 2014), extreme regional precipitation patterns (Boers et al. 2014), and anomalous polar vortex dynamics (Kretschmer et al. 2017). Typically, climate networks are constructed using a threestep procedure. First, choose a dataset of climatic variables, such as temperature or precipitation, measured on a fixed spatial grid. Then choose a notion of similarity between pairs of locations based on the corresponding time series in the dataset. Finally, construct a network with spatial locations as nodes and with edges between those pairs of locations that have the strongest similarities. Since we only have access to noisy time series of finite length, the calculated similarity values between pairs of locations will be noisy themselves: they are subject to estimation variability. As a consequence, any climate network that is constructed using a finite number of data might contain false edges (which should not be present) and have missing edges (which should be present). This leads us to the following important questions that have not received enough attention so far. Which kinds of distortions are induced in climate networks due to the sampling variability of the underlying time series? Which features of climate networks can be attributed to underlying structure, and which are random artifacts due to finitesample variation? These are the questions we discuss in this paper from a decisively statistical point of view. First observe that a climate network is built on a large number of pairwise similarity estimates: if our grid consists of 10^{4} locations, a naive procedure needs to estimate 10^{8} pairwise similarities. Even extremely wellbehaved estimators with a small variability will create a nonnegligible number of wrong edges in the network. But not many “wrong” or “missing” edges are necessary to distort important structural network characteristics. Even a single false longrange edge can substantially distort important network measures such as shortest pathlengths, smallworld properties, centrality, and betweenness measures, or the emergence of teleconnections. And through the local correlation structure that is inherent in climate data, wrong edges can propagate, leading to many wrong edges, even inducing wrong “link bundles,” that is, distinct regions connected by multiple edges.
To assess the severeness of this problem, we introduce a new null model for sampling time series that shares important properties with Earth’s climate system but at the same time is simple enough that we can control it, understand it, and simulate from it. To achieve this, we employ a locally correlated, isotropic data–generating process: isotropic random fields on the sphere. The key feature of this model is that the similarity of two time series only depends on the distance of the respective locations, nothing else. Locations that are close by tend to have more similar time series than locations that are far apart. Our model can thus capture important properties of real climate networks such as linklength distributions, but through its isotropic nature it is simple enough that erroneous patterns in the network can be clearly identified as statistical distortions. We introduce time dependence via a vector autoregression process [VAR(1)], which allows us to adjust the autocorrelation on each node. Consequently, the temporal autocorrelation structure can depend on the location, but the spatial correlation structure, and with it the ground truth network, remains approximately isotropic.
Sampling our null model allows us to systematically investigate the connection between noise in the similarity estimates and distortions in the network. Although the simulated data are only locally correlated, we find that complex network structures arise in the estimated networks because of imperfect estimation. For example, global spatially coherent betweenness patterns emerge (Fig. 5), which do not represent any ground truth structure. We also study the influence of choosing different similarity estimators, the influence of network sparsity on betweenness, distortions of other popular network measures, the emergence of spurious link bundles and highdegree clusters, and the biases introduced through anisotropic estimation variability. For example, we find that inappropriate estimators can result in arbitrarily wrong network estimates (Fig. 2). On the other hand, we illustrate that a conscious choice of networkconstruction techniques may increase robustness with respect to ground truth networks and may uncover different dynamics in the system. To filter out spurious edges, Boers et al. (2019) consider links as significant that do not appear alone but in bundled form. We show that when the data are locally highly correlated, the presence of one spurious long link makes the presence of neighboring links quite likely as well, leading to entire spurious link bundles.
In addition to our simulation results, we validate our findings with reanalysis data from the ERA5 project. We find that the tendency to form bundled connections increases with the strength of local correlations (Fig. 10). The betweenness structure in temperaturebased networks highly depends on the network density and the used dataset. This raises the question of whether finding a “betweenness backbone,” as in (Donges et al. 2009a), is possible and meaningful. For most climatic variables, we detect severe instability for long links. The nodes of highest degree tend to have autocorrelation (cf. Paluš et al. 2011). We conjecture that some edges from these nodes are spurious and are induced by the increased estimation variability on these nodes.
The wide range of potential empirical distortions makes a reassessment of many of the previous findings in the climate network literature desirable. However, this poses a big challenge: while our simulation study is based on a model with known ground truth, such a ground truth is not available for realworld climate networks. Yet, as our simulations show, it is extremely important to assess the reliability and robustness of findings based on empirical climate networks. Typically, researchers use approaches based on nodewise reshuffling of the time series or edgewise reshuffling of the given network. But we demonstrate that such techniques are inadequate to capture the inherent uncertainty of the network. Instead, we propose to estimate the variability in the network by computing multiple correlation estimates for each edge, while retaining the original spatial similarity structure. With this approach, we might get a statistically meaningful sense of the reliability of network patterns constructed from real, noisy time series.
Our main contributions are summarized as follows:

We introduce a VAR(1) process of isotropic random fields on the sphere as a suitable null model for geophysical processes, for which deviations from the ground truth are easily detectable.

We identify systematically occurring random artifacts and distortions in empirical networks and analyze why they arise.

We show which design decisions increase the robustness of constructed networks.

We validate our findings with ERA5 data.

We discuss the shortcomings of common network resampling procedures for significance evaluations and propose a statistically more meaningful framework based on jointly resampling the underlying time series.
The rest of the paper is organized as follows. In section 2 we describe typical networkconstruction steps and introduce the isotropic data–generating process we employ in our simulations. We present intuitions about the ground truth networks and explain when spurious behavior is to be expected in the empirical networks. Section 3 demonstrates several common patterns of spurious behavior in typically constructed networks, categorized into 1) estimator selection, 2) network measures, 3) link bundles, and 4) anisotropy. Section 4 points out problematic practices in significance testing and potential improvements. Finally, section 5 provides conclusions and possibilities for future work. For readers who are unfamiliar with climate network methodology, we have assembled an introduction in the online supplemental material (section A).
2. Network construction for data from spatiotemporal random fields
To study artifacts that are introduced by estimation procedures, we need access to a “ground truth network,” which is not available for realworld climate data. We therefore introduce a manageable stochastic process over Earth with known ground truth structure. We then use the model to evaluate how estimation procedures introduce random artifacts into the network estimates depending on networkconstruction steps and the features of the data distribution.
a. Climate network construction
The generic procedure of constructing climate networks from spatiotemporal data is described in algorithm 1: most studies deal with univariate realvalued data at each point in time and space such as temperature, pressure, or precipitation, and so do our experiments. Given a dataset of such time series on a fixed grid, the similarity between pairs of grid points is estimated. Popular similarity measures include the Pearson correlation, mutual information (MI), and event synchronization (Quian Quiroga et al. 2002). There are several ways to construct a network based on all pairwise similarity estimates. Most often, unweighted densitythreshold graphs are constructed (Tsonis and Roebber 2004; Yamasaki et al. 2008; Agarwal et al. 2019; Kittel et al. 2021), which means that an edge of weight 1 is formed between two grid locations υ_{i} and υ_{j} when the corresponding similarity estimate
Algorithm 1: Functional network construction from spatially gridded data
Input: Spatiotemporal data {X_{it}}_{i}_{∈[}_{p}_{],}_{t}_{∈[}_{n}_{]}, X_{i}_{⋅} = (X_{i}_{1},…, X_{in}) of time length n measured on p fixed locations V = {υ_{i}i ∈ [p]} in some metric space
1) Estimate the similarity between two points υ_{i} and υ_{j} based on the data and some estimator
b. Stochastic ground truth model for spatiotemporal data
To quantify how the similarity estimation process influences the induced networks, we specify a ground truth model, using random fields over the sphere, approximating Earth’s surface. Our goal is not to give an accurate model of Earth’s climate, but to point out generic patterns of spurious behavior in networks constructed from a limited amount of spatiotemporal data. The simpler the datagenerating model remains, the more accurately we can attribute spurious behavior to certain features of the data distribution or the employed networkconstruction steps. We use a datagenerating process in which the correlation between data measured at different locations depends only on the distance between the locations. Such isotropic random fields are common in geostatistics (Cressie 1993; Lang and Schwab 2015) and they allow us to attribute anisotropies in the estimated networks as erroneous.
Here, we first introduce the spatial stochastic process and, in a second step, add time dependence. The mathematical process that we are going to use is an “isotropic Gaussian random field.” A random field assigns a real value to every point of the sphere, imagine a surface temperature field. Centering (and possibly detrending and normalizing) data on each point in space yields a zeromean random field, representing socalled (detrended standardized) anomalies. When evaluating a Gaussian random field on finitely many points, its values are jointly Gaussian distributed. For isotropic random fields, the covariance between two points is solely determined by the distance between the points. Hence, a zeromean isotropic Gaussian random field is fully characterized by its covariance function k, which determines how smoothly and to what extent the random field varies across space.
One popular covariance function is the Matérn covariance function (section B.2 in the supplemental material), whose smoothness parameter υ and scale parameter
We introduce time dependence via a vector autoregression VAR(1) (section B.1 in the supplemental material) that allows us to assign any desired lag1 autocorrelation to each node of the network. Under this basic time dependence, we will be able to separate the effect of autocorrelation on the estimation procedure from other influences.
c. Ground truth networks and imprecise estimates
1) Ground truth networks
If we fix a grid and a networkconstruction method, a ground truth data distribution leads to a “true network” on this grid. Given the underlying data distribution, as in our model, we can calculate the true pairwise similarities between grid points. The networkconstruction procedure then determines the ground truth network based on the true similarities. For example, a ground truth densitythreshold graph simply consists of the edges corresponding to the largest similarity values. How much ground truth structure of a climatic process can be captured in the ground truth network depends on the choice of climatic variable, grid, similarity measure, and networkconstruction scheme. Another question is then whether this ideal network can be approximated with the available empirical data and estimators.
2) Errors in the estimated networks
Given a finite amount of data, we only have access to imperfect estimates of the true similarity values. Consequently, networks constructed from data as well as their characteristics will only be estimates of the corresponding ground truth quantities and inherit intrinsic variability. When the chosen similarity estimator is not suitable for the estimation task, the constructed graphs can look arbitrarily wrong (Fig. 2). However, by simple inspection, it is not possible to judge whether a constructed climate network reflects “true” aspects of the physical system or whether it is dominated by random artifacts introduced through the estimation procedure. For this reason, in our simulations we mainly address the following question: How do the estimated networks and their characteristics differ from their corresponding ground truth quantities? The answer depends on the properties of the random field, the employed estimator, and the considered network characteristic (see section 3). To get started, let us discuss how wrong individual edges occur and then how wrong link bundles arise.
3) Errors in individual edges
Errors in the network occur because the similarity estimates between locations vary around the ground truth similarity values. Falsepositive edges are wrongly included in the empirical network but are not present in the ground truth network; falsenegative edges appear in the ground truth network but are missing in the empirical network. Let us understand when these two cases arise in threshold graphs. Assume that the similarity estimate
4) How errors spread locally due to covariance
When the data are locally highly correlated, as is typical for climatic variables, this correlation may carry over to the joint distribution of similarity estimates. As a result, an error may propagate from one edge to edges on neighboring nodes in the following way: when the similarity estimate on one false edge is spuriously large, it is likely that the correlation estimates on edges on neighboring nodes are similarly large, so that these neighboring edges are also falsely included in the empirical network, resulting in false bundles of edges. In densitythreshold graphs, this makes some regions spuriously appear denser than others. A formal argument is given in section C in the supplemental material. Combining the thoughts above, false bundles of edges occur with high probability when measurements from close by points are highly correlated and the similarity estimates are imprecise. Find related simulation results in section 3e.
3. Spurious behavior in networks from finite samples
In this section, we explore the effects that imprecise estimates impose on commonly constructed climate networks. We do so by simulating the isotropic Gaussian random fields introduced above.
a. Network construction
We construct networks following algorithm 1. To approximately remove the effects of anisotropic grids, we generate a Fekete grid (Bendito et al. 2007) after 1000 iterations with 5981 points, approximately realizing an isotropic grid of 2.5° resolution. If not stated differently, we sample an
b. Visualizations
For network visualizations we use a Fekete grid with 1483 points, approximately realizing a 5° resolution. In our figures, dashed lines always denote ground truth values. Uncertainty bands cover the range between the empirical 0.025 and 0.975 quantile from 30 independent repetitions. The letter x and the circle denote 95% and 99% quantiles of a distribution, respectively, and the triangles, the minimal and maximal values of a distribution.
c. Estimation
1) Unsuitable estimators can induce many wrong edges
(i) Problem
When the marginal distribution on the nodes is heavy tailed, as in the case of precipitation data, commonly applied estimators become inadequate if they are sensitive to outliers. For instance, the naive correlation estimator has unusably large variance under heavytailed distributions; yet it has been applied to precipitation data in several studies (Scarsoglio et al. 2013; Ekhtiari et al. 2019, 2021).
(ii) Simulation results
Choosing σ^{2} allows to continuously adjust the heaviness of the tails: while small values of σ^{2} approximately recover the original correlation function k, increasing σ exponentially enhances the tail strength. For large σ^{2}, the correlation between grid points quickly drops to 0 with distance. We choose σ^{2} = 10, which is the correct order of magnitude to fit precipitation tails on global mean (Papalexiou 2018). Note that the precipitation distribution on Earth crucially depends on the location. Here, we solely aim to illustrate the intricacy of handling heavytailed data through isotropic simulations. Figure 2 demonstrates that the empirical correlation fails as a correlation estimator of data sampled from H. Because the empirical covariance is an average of lognormal random variables, it will be a large variance estimator of the population covariance. The large estimation variability induces many (possibly bundled) false and missing links. For short time series, you can imagine single events dominating on each node. When these events occur at the same time for a pair of nodes, the nodes will show high empirical correlation, although the true correlation may be zero.
(iii) Consequences
Removing outliers or finding a suitable data transformation reduces this problem. By design, log(⋅) would transform the random field back to a Gaussian random field. Alternatively, we can employ an estimator that is robust to heavytailed distributions (Minsker and Wei 2017). Since the Spearman correlation is invariant under monotonous transformations, it produces exactly the same results for the normal and lognormal data. An alternative to Spearman correlation with faster convergence rates is Kendall’s tau (Gilpin 1993). Barber et al. (2019) consider several of the above ideas to estimate correlation in the context of hydrologic data.
2) Comparing similarity measures as well as estimators
(i) Problem
While the empirical Pearson correlation estimator has often been equated with the corresponding similarity measure, we can strictly reduce the estimation variance in Pearson correlation networks by considering a different estimator—even for Gaussian data. Radebach et al. (2013) have shown that many characteristic network patterns are already visible in Pearson correlation networks, and historically, the Pearson correlation has been the most popular similarity measure (e.g., Tsonis and Roebber 2004; Tsonis et al. 2008; Yamasaki et al. 2008; Paluš et al. 2011; Fan et al. 2022). As estimators of mutual information need to be able to capture arbitrarily complex dependence structures, they tend to require even larger sample size to achieve reliable accuracy than do correlation estimators, resulting in more spurious behavior given the same sample size.
(ii) Simulation results
We consider the following similarity measures and their estimators. For MI, we use a simple binning estimator as applied in the complex network python package pyunicorn (Donges et al. 2015), where we use
Figure 3 shows the false discovery rate (FDR), which measures the fraction of false links, as a function of network density. Although fewer true links are available for small densities, sparse graphs are more accurate in terms of the FDR, because the correlation values of ground truth links can be empirically distinguished with high certainty from most false links under estimation variability. For random fields with long length scales, this empirical separability remains intact for longer edges. Therefore, the FDR remains low up to larger network densities (see also Fig. 9). Given the same amount of data, more complex similarity measures perform worse. For sparse graphs, the Hilbert–Schmidt independence criterion shows promising performance compared with the mutual information estimators. The biascorrected KSG estimator is computationally expensive with fluctuating performance, and the binned MI estimator is strictly worse than HSIC. Unweighted empirical Pearson and Ledoit–Wolf densitythreshold networks coincide because they produce the same ranking of edge weights. Figure 3c shows the error of the estimated correlation matrix compared with the ground truth in Frobenius norm under various hyperparameter settings. The ground truth correlations grow monotonously from left to right. Note that the empirical correlation matrix makes large estimation errors irrespective of the parameters of the random field. The linear Ledoit–Wolf estimator improves the estimation in all cases. Consequently, fixedthreshold networks, as well as weighted networks, are better approximated by the Ledoit–Wolf estimator. The less correlated the grid points are, the lower the error of the Ledoit–Wolf estimator as it shrinks the correlation estimates toward an identity matrix.
For real data, we cannot calculate the FDR as we do not know which links are false. Instead, we generate bootstrap samples of all time points and create perturbed datasets by including the measurements on the entire grid at these time points. We then construct several networks with the same density from these perturbed datasets and finally compute the fraction of differing links between pairs of sampled Pearson correlation networks (Fig. 4a). With this procedure, we approximate the network distribution induced by the dataset (see section 4b). High autocorrelation causes the need for blockwise bootstrapping to receive consistent estimates, as the network variability is increasingly underestimated with increasing autocorrelation. The results should be seen as a conservative preliminary insight into the intrinsic network variability and the number of unstable edges. A robust networkconstruction procedure should yield a low fraction of fluctuating links across bootstrap draws. Narrow uncertainty bands indicate that varying weighting of climatic regimes among the bootstrap samples has limited influence on the networks. Observe that in t2m and pr networks, an alarming fraction of links fluctuate (Fig. 4a), while networks from smooth variables with long length scale, such as sp and z850, fluctuate less (consistent with Fig. 9). In contrast to the synthetic data, the curves do not grow monotonically in the sparse regime. Therefore, resampled networks may be helpful in choosing a maximally robust density, minimizing the fraction of varying edges in the empirical networks. Density up to 0.01 seems to be an appropriate choice for t2m; larger densities dramatically decrease the network robustness. The differing links do not contain short edges (Fig. 4b), as the correlation values on short edges are consistently large. Longer links heavily depend on the sampled time points and are sensitive to slight perturbations of the correlation estimates. Hence, the decision as to which long links to include should not be based on a single correlation estimate. Geopotential heights behave differently and become more stable at larger densities as they have a correlation structure with an extremely large length scale.
(iii) Consequences
The selection of appropriate similarity measures depends on how much data are available. Mutual information estimators require much more data to yield reliable results than do correlation estimators. HSIC shows promising performance in our experiments. It may be worth exploring other alternatives to MI, such as Romano et al. (2018), in the future. Although not particularly well suited for random field data, the Ledoit–Wolf estimator is a uniform improvement over naive empirical cross correlation when estimating weighted Pearson correlation networks. Future work should put more focus on which estimators perform best on meteorological data.
For most climatic observables, we detect severe network variability for long links. To quantify the structural and link robustness of constructed networks, we need resampling procedures that adequately capture the intrinsic network variability (section 4b). Small network densities yield more robust networks in terms of differing/fluctuating links in resampled networks.
d. Network measures
1) Extreme betweenness values are unreliable in sparse networks
(i) Problem
The betweenness centrality of a node υ_{k} is given by the expression
(ii) Simulation results
Figure 5 shows betweenness maps of networks, constructed as in Donges et al. (2009a), from independent draws of our locally correlated, isotropic model. Because of the standard Gaussian grid, randomly fluctuating betweenness “backbones” emerge that form global pathways but do not represent ground truth structure. The density is chosen such that there exist few pathways between the wellintraconnected poles. Which exact nodes lie on the few shortest paths between the nodedense polar regions depends on which false links are formed in the region between the poles. Phrased differently, the supposedly important nodes with high betweenness are precisely the ones with false edges and alter between different realizations of the process. The difference map (Fig. 2 of Donges et al. 2009a) between networks from different datasets shows strikingly similar north–south pathways. The latticelike structure makes the sparse ground truth network highly susceptible in terms of betweenness. Since the datagenerating process is isotropic and the Gaussian grid is symmetric with respect to longitudinal rotations, nodes on the same latitude have equal betweenness values in the ground truth network. The empirical networks consistently show systematically different betweenness distributions. While the maximal betweenness value in the ground truth network is 1.68 × 10^{−3}, the empirical networks have much more pronounced extreme betweenness values of 2.36 × 10^{−2} ± 8.23 × 10^{−4}. A visualization of the heavytailed betweenness distribution, as well as an analysis of Forman curvature, can be found in section F.1 in the supplemental material. Even sparse ground truth networks are highly sensitive to slight perturbations of the grid and network density, as there exist few important pathways connecting different regions in a sparse, locally connected network.
Figure 6 shows betweenness maps of networks, constructed from daily t2m as in Donges et al. (2009a), but on the approximately isotropic Fekete grid. The betweenness “backbone” fluctuates more in the sparse than in the dense networks. The maps in Fig. 6 and the maps presented in the original study all look different because betweenness is unstable with respect to grid choice, dataset, and network density. This raises the question of which, if any, map shows a true betweenness “backbone.” The dense networks (Figs. 6d–f) are more stable than the sparse networks (Figs. 6a–c) with respect to network density perturbation. But only after validating that the finitesample network variability is also low (see section 4b) should patterns uncovered by network methods be interpreted with domain knowledge to generate novel insights. Here, a domain expert might point out stable ENSOlike patterns in the eastern Pacific in the stable dense networks (Figs. 6d–f) and get the inspiration to more closely investigate surprising patterns revealed by the betweenness map.
(iii) Consequences
Although forming sparse networks yields a better false discovery rate, some network characteristics become extremely sensitive to small perturbations of the networks and their explanatory power diminishes, even in ground truth networks. For each network measure of interest, the choice of network density constitutes a tradeoff between the false discovery rate and the robustness of the measure. Here too, independent ensemble members can help to identify stable patterns (see section 4b). When a network measure fluctuates too much, as betweenness does in sparse networks, results should not be overinterpreted.
2) Empirical distributions of network characteristics are distorted
(i) Problem
Here, we present further perspectives on systematic empirical distortions of network measures. Most studies focus on the extremal nodes for any network measure, interpreting these as particularly important. Our simulation results show that, under data scarcity, random nodes appear spuriously important in the empirical networks, not representing important nodes in the ground truth network. Several studies have constructed Pearson correlation networks from sliding windows with 2.5° and finer resolution (Radebach et al. 2013; Hlinka et al. 2014; Fan et al. 2017, 2018, 2022). Our simulation results suggest that the naive correlation estimator and the short time scale are risk factors for false edges and distortions in global measures and extreme values of the networks.
(ii) Simulation results
Although all nodes have roughly the same degree/clustering coefficient in the ground truth graph, the observed degree/clustering coefficient distribution is more spread out in empirical networks (Fig. 7). The random distortions in empirical networks are similar in type and extent for different independent realizations. Spuriously extreme nodes in the empirical networks vary between independent realizations and do not reflect important or clustered nodes in ground truth graphs. While the average unweighted degree is consistent by construction, the weighted degree is systematically upward biased, as more links of low ground truth correlation are available that can be overestimated than links of large ground truth correlation that can be underestimated. The empirical (weighted) clustering coefficient is strongly downward biased, as spurious links connect otherwise disconnected regions and bundled connections are not formed between entire neighborhoods. Spurious teleconnections serve as shortcuts in the networks and lead to systematically smaller shortest pathlengths. Another network measure related to the clustering coefficient and shortest pathlengths is smallworldness. Since both network measures are extremely distorted in the empirical networks, a reliable conclusion about ground truth smallworldness cannot be drawn from the empirical networks in our setting. A more detailed treatment of smallworldness in spatially extended systems can be found in Bialonski et al. (2010) and Hlinka et al. (2017). The conclusions of both studies resemble ours. The length distribution of the spurious links (longer than the dashed lines in Fig. 7c) behaves as the number of available links at each distance: sinusoidal. This occurs when the corresponding true correlation values are empirically indistinguishable and has also been found in climate networks based on event synchronization (Boers et al. 2019). Under large length scales, empirical networks contain fewer erroneous edges and a more accurate linklength distribution up to higher network densities. On the other hand, empirical networks show larger spreads of degree, clustering coefficient, and shortest pathlength distributions, as false links occur in bundles (see section 3e). Under small length scales, bundling behavior is less pronounced, so that the amount of spurious links averages out, resulting in a more concentrated degree distribution, although more false links occur.
(iii) Consequences
When empirical networks are constructed with scarce data, they possess systematically different characteristics compared with the ground truth structure. In our setting, distributions of popular node measures are more spread out as well as systematically biased. These distortions do not become apparent by considering empirical summary statistics based on time series resampling because the empirical behavior remains consistent between independent repetitions. However, given multiple sufficiently independent network estimates, spurious and ground truth extreme nodes can be distinguished, depending on how systematically they reappear in several networks. Consequently, the distribution as well as the extreme values of network measures in empirical networks can primarily be the result of estimation errors and should not be overinterpreted. In particular, whenever the number of formed links scales with the number of available links, large estimation variability can be the cause, so researchers should make additional efforts to justify the correctness of their network when this linklength distribution arises, as in Boers et al. (2019).
e. Local correlations give rise to spurious link bundles and highdegree clusters
1) Problem
As single false links occur with high probability in estimated networks, Boers et al. (2019) considered teleconnections in a climate network as significant only when a bundle of edges from one region to another is formed. As discussed in section 2c, this approach is unreliable when the underlying data are locally correlated, because edges tend to be formed in bundles. Spuriously dense regions in densitythreshold graphs are another possible repercussion. Here, we provide empirical evidence of spurious bundling behavior.
2) Simulation results
Figure 8 shows the maximal distance of occurring link bundles (Figs. 8a,c) and the fraction of false links that belong to some bundle (Figs. 8b,d) for various notions of link bundles and for unweighted (Figs. 8a,b) and weighted (Figs. 8c,d) networks. The hyperparameters υ = 0.5 and
While for the FDR (Fig. 9a), large smoothness and length scale have a positive impact (because the random field varies less in total), the links that are being formed tend to occur in bundles. The essential distributional parameter for sparse networks is the smoothness of the random field, as mostly short links are formed. Varying the length scale has a larger impact on denser graphs as it determines the radius of spurious link bundles and the distance/network density at which ground truth correlations become empirically indistinguishable from 0. To measure whether there are regions of spuriously high degree due to the dependence between nodes, we find the εball of maximal average degree (MAD) among all εballs B_{ε}(υ_{i}) (Fig. 9b). Then we compute the same quantity for randomly permuted degree values so that nodes with spuriously high degree are not spatially clustered anymore. The MAD values of the empirical networks are consistently larger than the MAD values of shuffled nodes, so a clustering of highdegree nodes occurs irrespective of the hyperparameters of the random field. The pronounced bundling behavior for larger length scales is reflected in larger MAD values.
For real data, observe a strikingly monotonous relationship between the average local correlation and the fraction of long links (longer than 5000 km or 0.25π radians) that belong to some link bundle (Fig. 10). Most differing links do not belong to a bundle, but under large local correlations the number of fluctuating long links in bundles can become nonnegligible. Also observe that our simulated data, at a given local correlation level, show a tendency to underestimate the fraction of links in bundles, indicating the existence of true bundled teleconnections in climatic variables.
3) Consequences
We have seen that bundled connections do not necessarily represent ground truth structure but can occur spuriously when the similarity estimates are locally correlated. Even without teleconnections, random regions can appear spuriously dense. These experiments also explain the distortion of the degree distribution in Fig. 7. Using edge weights can be helpful to distinguish strong from weak connections, as spurious connections tend to lie marginally above the threshold. We conclude that when the data are locally correlated, questions of edge significance cannot be easily addressed by considering bundling behavior. Only bundling behavior that exceeds the effects of the localized correlation structure can be considered significant. Given multiple sufficiently independent empirical networks, only ground truth connections would reappear in many networks with high probability.
f. Anisotropy
1) Anisotropic autocorrelation on the nodes causes biased empirical degree
(i) Problem
In practice, different locations have different autocorrelation patterns. Due to higher effect heat capacity, temperature over oceans has higher autocorrelation than over land (Eichner et al. 2003; Vallis 2011). Guez et al. (2014) argue that disagreement between their similarity measures is primarily caused by high autocorrelation. Our simulation results suggest that the cause of this disagreement might more fundamentally be estimation errors that vary between similarity measures.
(ii) Simulation results
We simulate anisotropic autocorrelation (Fig. 11) by employing our VAR(1) model (section B.1 in the supplemental material). We initialize a random half of the points with low lag1 autocorrelation of 0.2 and the other half with a high lag1 autocorrelation of 0.7. On average, empirical Spearman correlation estimates do not depend on the autocorrelation of adjacent nodes (Fig. 11a). However, the increased variance on highly autocorrelated nodes leads to both an increase of spuriously low similarity estimates for edges with high ground truth correlation, as well as more spuriously high estimates on edges with small ground truth correlation. Since most ground truth correlations are small (Fig. 11b), overall, the number of high similarity values increases. Thus, nodes of higher autocorrelation show an increased average degree in threshold graphs (Fig. 11c).
In real climate networks, the nodes of highest degree consistently have high lag1 autocorrelation (Fig. 12). Together with our simulations, this suggests that anisotropic autocorrelation has nonnegligible spurious effects on the networks. Recalling Eq. (1), forming false links between highly autocorrelated nodes is much more likely than between nodes of small or intermediate autocorrelation. Hence, both false edges at nodes with high autocorrelation and missing edges at nodes with low autocorrelation have to be expected when some nodes attain autocorrelation values close to one, as for t2m.
(iii) Consequences
Under large isotropic autocorrelation, densitythreshold networks have an increased variability but no degree bias. When the variability differs across locations, nodes with high variability receive more false edges than nodes with informative time series. Stateoftheart corrections are discussed in the next section. Using knearest neighbor (kNN) graphs prevents disregarding weakly autocorrelated locations. In kNN graphs, each node forms an edge to the k nodes with highest similarity. Although highly autocorrelated locations may still have more spuriously high empirical similarity values, weakly autocorrelated points attain more similar importance in terms of degree in unweighted as well as weighted kNN graphs.
2) Fourier transformbased reshuffling reverses the autocorrelationinduced degree bias in sparse networks
(i) Problem
Instead of constructing densitythreshold networks some, studies only include links that are significant with respect to similarity values from reshuffled data (Boers et al. 2014; Deza et al. 2015; Boers et al. 2019). For this purpose, the time series on each node is shuffled independently multiple times, and similarity values between these shuffled time series are calculated to determine the internal variability of the similarity estimates on each edge. High quantiles of this edgewise baseline distribution of similarity estimates impose restrictive thresholds above which ground truth dependence is likely. If the quantile and variance estimates are themselves noisy, yet one more source of randomness is introduced into the network estimation procedure.
(ii) Simulation results
We follow the popular approach to construct densitythreshold networks not from the correlation estimates
In the following way, perfect quantile estimates allow us to determine network densities that lead to empirical networks containing few false edges. Under isotropic autocorrelation, perfect quantile estimates induce a common threshold on the entire network. Applying this threshold to the ground truth correlation matrix induces a density in the ground truth network. For our range of hyperparameters, the densities, induced by the 0.95 quantile, range from 0.007 (for υ = 0.5,
(iii) Consequences
In settings of low autocorrelation, completely random reshuffling yields reliable estimates of empirical correlation quantiles, resulting in a controlled false discovery rate. However, it is not able to detect anisotropic autocorrelation and can therefore not correct autocorrelationinduced degree bias. The IAAFTbased empirical networks correct this bias in dense networks, but are very biased toward edges with low estimation variance in sparse zscore networks, which results in many false links. The empirical density of significancebased networks only yields an upper bound on a desirable network density. Since no considered networkconstruction technique reduces the variance in the edge estimates, they cannot vastly improve on densitythreshold networks.
3) Anisotropic noise levels on the nodes cause nodes to be disconnected
(i) Problem
Observational data are generally affected by measurement errors or other sources of noise. Under isotropic additive white noise, variance in the graph construction increases (see section F.3 in the supplemental material). Even worse, anisotropic noise levels crucially distort how well nodes are connected in the graph.
A central difficulty in recovering ground truth structure is distinguishing which part of the noise is inherent to the dynamical system (aleatoric noise) and which part could be reduced through more sophisticated measurement, preprocessing, and estimation procedures (epistemic noise). While aleatoric noise affects ground truth networks and can be seen as an offset of the ground truth correlation function, everything else is an empirical distortion. Nodes over land are commonly less connected in climate networks (Donges et al. 2009b) because the underlying distributional characteristics differ across sea and different geological conditions over land. This distributional difference is at least partially aleatoric. Varying availability and reliability of measurements, on the other hand, induce epistemic noise.^{1} In cases where we acknowledge that we cannot satisfactorily judge how much our data are affected by epistemic noise, a conservative approach is to reduce the effects of anisotropy in the network construction. kNN graphs may offer a useful inductive bias in such uncertain settings.
(ii) Simulation results
By adding white noise on the Northern Hemisphere, we decrease the population correlation for these nodes. As a result, we find that (especially sparse) threshold graphs mostly form edges on the nodes with less noise (Fig. 14). To represent all nodes equally in the graph, we propose to use kNN graphs instead. By using weighted edges, the spectrum of node and link importance in terms of weighted degree does not get lost.
(iii) Consequences
When the available data are affected by epistemic noise, the connectivity structure in the network is spuriously altered. Effects of anisotropic noise on the empirical networks can be reduced by using kNN graphs. When the ground truth correlations are higher in some regions than in others (anisotropic aleatoric noise), kNN graphs can also be more informative because weakly correlated nodes are not well represented in densitythreshold networks. kNN graphs pose a different inductive bias, which may be useful to detect different patterns. Although ground truth kNN graphs severely differ from ground truth densitythreshold networks in anisotropic settings, given useful weights, they have shown to be useful and robust in machinelearning applications (von Luxburg 2007), while not sacrificing interpretability.
4) Ground truth networks on anisotropic grids
(i) Problem
Anisotropic grids usually introduce biases in networks that are not intentional, so that differing node connectivity does not reflect differing correlation structure in the data. Given an anisotropic grid, the nodes will have unequal characteristics in the ground truth network under isotropic correlation structure. It is well known that a regular Gaussian grid is geometrically undesirable due to its two singularities at the poles. Area weighting (Heitzig et al. 2011) becomes crucial to correct the distortions in the network. Another effect of anisotropy does not stem from anisotropic grid choice but from geographical reality. If we consider an isotropic field with a monotonically decaying correlation function on an approximately isotropic grid only defined over oceans, then the nodes in the population network will not be isotropic but will encode geometric information about the distribution of land and sea across Earth (Fig. 15). For example, sea surface temperatures are only defined over oceans.
(ii) Simulation results
The ground truth networks constructed from a monotonically decaying isotropic correlation structure simply consist of the shortest possible links. The anisotropic distribution of grid points introduces a bias to the networks that is visible in various network measures. For example, points on paths connecting different regions and points in geometric bottlenecks show higher betweenness values in sparse networks, points with large uninterrupted surroundings show higher degree, and points in inlets show larger clustering coefficient, because neighbors toward similar directions are often close to each other and thus also connected. The network density functions as a scale parameter similar to a bandwidth in kerneldensity estimation, since the connection radius increases with network density.
(iii) Consequences
Estimated networks suggest misleading conclusions when false edges distort their characteristics. Even the ground truth network is a result of many design decisions that can lead to prominent behavior, readily misinterpreted when its cause is not correctly identified. For betweenness, even the ground truth values are very sensitive to small variations in network density. Even in ground truth, conclusions are not necessarily robust. Boundary correction (Rheinwalt et al. 2012) has been proposed for networks that do not cover the entire Earth. A similar correction, using locally connected networks, could be proposed to remove the influence of the distribution of land and sea across Earth to quantify anisotropic correlation behavior.
4. Assessing significance from network ensembles
In practice, researchers are usually confronted with datasets of limited size from an unknown distribution. Based on a single network constructed with stateoftheart climate network techniques, they cannot judge how many edges are included or excluded because of estimation errors. Given time length n and number of grid points p, the regime of “small sample size,” where the observed networks significantly differ from the ground truth, can mean any order of magnitude for n and any ratio n/p, depending on the dynamics of the spatiotemporal system, measurement error, the employed estimator, and the subsequent network construction and evaluation steps. This makes general rules of thumb prohibitive, and solid uncertainty estimation based on unrestrictive assumptions crucial for the value of the study. Constructing networks with various similarity measures, datasets, resolutions, and networkconstruction steps (see, e.g., Radebach et al. 2013) can offer qualitative reassurance that observed patterns do not just occur under a specific setting. Significance tests offer a more quantitative approach. Here, we first discuss the shortcomings of common procedures to quantify significance in section 4a, and then offer a new probabilistic framework in section 4b that addresses these shortcomings.
a. Resampling in current practice
The usual approach to quantify the significance of certain findings such as hubs, pathways or teleconnections is to construct an ensemble of networks that share certain aspects of the originally constructed network while randomizing with respect to everything else through reshuffling. The effective null hypothesis of such a permutation test (also called a surrogate test) is the (limit) probability distribution over the networks that the ensemble induces. Needless to say, any permutation test can only be as meaningful as its effective null hypothesis.
All previously applied reshuffling approaches for climate networks that have been reported in the literature can be categorized into two types. Either reshuffling is directly performed on the edges to recover, for example, the original degree sequence or the degree sequence and linklength distribution (Wiedermann et al. 2015, GeoModel II), or the time series are shuffled nodewise to preserve the marginal time series dynamics with methods such as the iterative amplitudeadjusted Fourier transform (Schreiber and Schmitz 1996). In the latter case, the ensemble networks are then constructed from the shuffled dataset.
1) Nodewise reshuffling
Whenever researchers have performed permutation tests that recover marginal time series dynamics, these tests have disregarded the spatial distribution of the data completely. The nodes are assumed to be independent, so that the typical localized correlation structure, which results in a linklength distribution of predominantly short links, is replaced by a uniform one (Fig. 16a). Since the task is to construct spatial networks, such an ensemble is structurally unrealistic and does not induce a physically meaningful network distribution.
2) Edge reshuffling
Whenever fixing concrete network characteristics to be preserved, a preliminary question has to be addressed. Which spatial, as well as temporal, dynamics of the system or network at hand need to be preserved by the ensemble? The authors of the influential paper (Donges et al. 2009a) use a permutation test with preserved degree sequence. But what is the physical meaning of the observed degree sequence? One consistently reappearing feature of the underlying physical system is the localized correlation structure, which results in a linklength distribution of predominantly short links. As for nodewise reshuffling, this linklength distribution is destroyed and is replaced by a sinusoidal one. As a consequence of this inaccurate ensemble distribution, the authors interpret the property that the nodes of highest betweenness show degrees below average in the original network as significant behavior. In contrast, we have shown in Fig. 5 that this property is a bias that is introduced through the Gaussian grid and the latticelike connectivity behavior of the original network. The original connectivity structure gets destroyed by uniform edge reshuffling; hence, the ensemble members have different betweenness properties. We have seen that betweenness is a highly unstable measure. An indication for robustness of the discovered betweenness “backbone” would be if it consistently reappeared for various subsets of the data, as well as for many network densities and similarity estimators. Since only a single instance of the climate network is presented, it is not clear if the presented backbone appears by chance.
A first improvement over random link redistribution or independence between the time series is GeoModel II (Wiedermann et al. 2015), which approximately preserves both the degree distribution and the nodewise linklength distribution. But it still does not recover the natural tendency in the network to form bundled links. Given that one link is formed, the likelihood of a neighboring link increases in a locally correlated random field. When simply recovering the total linklength distribution, as does GeoModel II, this likelihood does not increase (Fig. 16b). Furthermore, fixing the degree sequence of the nodes might not be representative of the distribution of the constructed graph. In section 3e, we have seen spurious highdegree clusters on random locations. In conclusion, no explicit resampling scheme has been proposed that recovers the joint link distribution of locally correlated random fields.
Proposing a concrete network resampling scheme always runs the risk of missing or distorting an important aspect of the underlying dynamical system and estimation procedure. The tendency to form bundled connections depends on the localized correlation structure. The expected number of spurious links depends on the (co)variability of the estimates, the spatial correlation structure of the random field, and the autocorrelation of the time series, among many other factors. All these aspects do not even cover the more complex time series dynamics that we might want to account for.
b. Distributionpreserving ensembles
Instead of trying to solve the impossible problem of finding which exact characteristics to preserve for the climatic question of interest, we propose to construct network ensembles such that all members approximately reflect the network distribution that originates in the underlying physical processes. As discussed above, stateoftheart network resampling approaches calculate many estimates of surrogate networks that do not reflect the original distribution. Crucially, they only obtain one noisy estimate of the similarity value on each edge. With multiple estimates, we not only get a more robust total estimate but an approximation of the estimation variability on each edge. Instead of a single network estimate, we have access to an ensemble of equally valuable estimates that allows us to judge whether a network estimation procedure really is trustworthy and empowers us to tackle most of the issues presented in section 3. Instead of reshuffling the data nodewise, we propose to jointly subsample or resample timewindows for both end points of an edge or even for all nodes simultaneously, preserving the original dynamics in space. There is an abundance of resampling techniques for multivariate time series. One popular approach is block bootstrapping (Lahiri 2003; Shao and Tu 1995). Another is subsampling (Politis et al. 1999). Individual ensemble members should both reflect the same data distribution (induced by the selection of included time points), as well as be sufficiently independent (by representing different timewindows that are far enough apart). Such an ensemble could be constructed by the following pipeline:

Decide on a networkconstruction procedure with unbiased edge estimates, such that the dynamics of interest behave robustly within the network distribution.

Construct many networks with the same procedure (i) by subsampling/block bootstrapping data in time on all grid points simultaneously.

Evaluate reoccuring edges and patterns such as link bundles.
If the quantity of interest can be expressed by summary statistics, the pipeline as a whole should be unbiased to yield calibrated confidence intervals. We have seen that the maximal degree or the linklength distribution of the networks can be systematically biased. Crucially, estimates of single edges or fixed neighborhoods are unbiased when the employed similarity estimator is unbiased. In this case, given a large enough ensemble, uncertainty estimates and p values would be precise. In practice, unbiased estimators often increase estimation variance so much that the uncertainty becomes too large. In complex estimation tasks like mutual information estimation, this approach might reveal that empirically constructed networks are too incoherent due to a lack of available data. Instead of communicating false certainty, the bootstrap approach would suggest conclusions like “the amount of available data does not suffice to significantly detect this teleconnection with our smallbias mutual information estimator.”
From a dynamical systems perspective, one could argue that different time points represent a different state of the dynamical system. From a statistical perspective, a distribution over networks is implicitly chosen when selecting the dataset. The remaining task is to define what characterizes the state of the dynamical system one is interested in, such as different ENSO phases. If we construct an ensemble that is independent of the state, we simply recover links and patterns that are present most of the time. Teleconnections that are only active some of the time become difficult to distinguish from noisy connections.
A single network that is more accurate than the result of a single similarity estimate per edge can be derived by selecting the edges that appear in most ensemble members. A related approach is the variable selection method stability selection (Meinshausen and Bühlmann 2010). In practice, stability selection often markedly improves the baseline variable selection or structure estimation algorithm. Another approach could only accept links with small estimation variability.
While most studies directly perform bootstrapping on the graph structure and not on underlying data (Chen et al. 2018; Levin and Levina 2019), a similar idea has been previously suggested (Friedman et al. 1999). To the best of our knowledge, it has not yet been applied in climate science. In practice, data scarcity, distribution shift, and varying regimes of the dynamical system complicate finding a suitable resampling or subsampling scheme that produces both sufficiently independent and identically distributed ensemble members without biases. In both bootstrap as well as subsampling techniques, a suitable choice of window size depends on the autocorrelation of the time series at hand. Highly anisotropic autocorrelation (Fig. 12) complicates designing a consistent procedure for all nodes simultaneously. With these complications and the journal’s page limit in mind, we postpone proposing an explicit ensemble construction procedure to future work.
5. Conclusions
When constructing networks from data, it is not obvious that they reflect ground truth structure. Given a finite amount of data, similarity estimates contain estimation errors. Under such nonnegligible estimation variability, we find several types of spurious behavior using typical networkconstruction schemes:

Not only the choice of similarity measure, but also the choice of estimators, is an influential design decision. The properties of the estimator determine how well single empirical networks approximate the population network and if the uncertainty of an ensemble is accurate.

Global properties of finitesample networks such as averages, variances, and maxima of network measures, or the spectrum of the adjacency matrix, can be heavily distorted.

Links occur in bundles when the data are locally correlated and the estimator transmits the correlation structure. This leads to spurious link bundles and regions of spuriously high or low degree.

Under anisotropic autocorrelation or marginal distributions, differing data distributions on the nodes cause anisotropic estimation variability on the edges, which in turn introduces biases in the empirical networks. Anisotropic noise levels may lead to weak representation of nodes in the network, weighted kNN graphs reduce anisotropic behavior via the inductive bias to represent all nodes equally, and differences can still be detected via edge weights.

We find sparse networks to be more accurate in terms of false discovery rate and spurious teleconnections. Yet popular network measures such as betweenness become highly unstable in sparse networks. This constitutes a different tradeoff for each estimation task. Random fields with larger length scales allow for denser networks, but also lead to more pronounced bundling behavior.
Given the variety and extremeness of possible empirical distortions, it is crucial to reliably estimate how “trustworthy” an empirical network is. Stateoftheart resampling procedures only capture particular parts of the empirical network distribution, and consequently miss other possibly relevant aspects of the dynamical system. When the implicit null hypothesis of the resampling technique does not capture all relevant properties of the dynamical system, the value of the significance test is questionable. Specifically, surrogate tests, which hypothesize independence between nodes, do not reflect a physically meaningful null hypothesis when the dynamical system is locally correlated, which the linklength distribution typically clearly reflects. Random artifacts that stem from local correlation structures will then appear significant. In the past, climate network approaches have been based on calculating a single similarity estimate on each edge; we propose to generate multiple estimates via sub or resampling in time, in order to estimate the estimation variance on each edge. This allows us to approximate the intrinsic distribution over the constructed networks induced by the underlying data distribution and the chosen estimation procedure.
Future work
Most importantly, future network studies in climate science should estimate the underlying estimation error in each edge in order to argue about significance in a statistically meaningful way. Given scarce data, the variability of similarities cannot be precisely estimated (Fig. 13a). Further challenges inherent to climatic data have to be addressed to successfully implement our proposed framework of network ensembles, which represent the underlying dynamics by design and are sufficiently independent in time. Adjusting the window length for block bootstrapping on each edge to the autocorrelation of the nodes could yield consistent estimates of the estimation variance on all edges. More work is needed to develop good similarity estimators, robust networkconstruction procedures, and resampling techniques for the climate context, respecting distribution shifts, varying regimes, anisotropy, and measurement errors in chaotic dynamical systems.
While we only construct undirected networks, directed empirical networks as well as EOFs suffer from insufficient data in analogous ways. From estimating lagged correlations to probabilistic graphical models (Koller and Friedman 2009) and causal networks (Runge et al. 2019a), a similar simulation analysis could quantify the strengths and weaknesses of different networkconstruction procedures. Some works have reduced the network size by clustering spatial areas of temporally coherent behavior before networkconstruction (cf. Rheinwalt et al. 2015; Fountalis et al. 2018; Runge et al. 2019b). Whether such an approach is statistically beneficial is task dependent and deserves a more thorough consideration in future work. On the one hand, too many nodes may induce many errors and a systematic distortion in the spectrum of empirical covariance matrices (Donoho et al. 2018; Lam 2020; MoralesJimenez et al. 2021); on the other hand, single errors have a higher impact in smaller networks, and node estimation constitutes yet another challenging task in the networkconstruction pipeline. A key question in this regard is: how can we minimize the edgewise estimation variance in the downsized network? Previous work has selected representative locations for each cluster, but maybe an aggregation of regional information can boost statistical robustness further.
More complex time dynamics could introduce other kinds of spuriousness into empirical estimates we were not able to cover with our simple autoregressive model. A theoretical analysis of networks from spatiotemporal data would also be very insightful. Exploring alternatives to mutual information, such as the Hilbert–Schmidt independence criterion or the randomized information coefficient, could yield novel insights into the dynamics of the climatic system.
Networks have been constructed in several scientific fields to detect complex structures in spatiotemporal data. Geneticists try to identify connections between certain genes and the development of diseases by estimating Pearson correlation networks under the term weighted gene coexpression analysis (Horvath 2011; Niu et al. 2019). Neuroscientists (Sporns 2010) aim to understand the functional connectivity in the brain with weighted voxel coactivation network analysis (Mumford et al. 2010). While in this work we focus on the application of functional networks in climate and geoscience, our conceptual findings hold in any domain where networks are constructed from spatiotemporal data.
In reanalysis datasets, data over nodes in regions of high measurement density can be extrapolated with higher certainty. The density of weather stations in the United States or Europe, for example, is much higher than in parts of Africa or South America. The effort of estimating the measurement/extrapolation error in each node could alleviate the effects of an anisotropic data collection and extrapolation process. Anisotropic measurement/extrapolation noise remains to distort the constructed climate networks, and efforts should be made to gather more reliable measurements in neglected regions [cf. overview of WMO weather stations (ArcGIS 2022)].
Acknowledgments.
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (EXC 2064/1—Project 390727645). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRSIS) for supporting Moritz Haas. We thank all members of the Theory of Machine Learning and the Machine Learning in Climate Science groups in Tübingen for helpful discussions. We thank all reviewers and our editor for valuable feedback. Finally, we thank Joe Guinness for his advice concerning the use of Matérn covariance on the sphere.
Data availability statement.
Hersbach et al. (2018, 2019a,b) were downloaded from the Copernicus Climate Change Service Climate Data Store. Python code for reproducing all results in this paper can be found at https://github.com/moritzhaas/climate_nets_from_random_fields/.
REFERENCES
Agarwal, A., L. Caesar, N. Marwan, R. Maheswaran, B. Merz, and J. Kurths, 2019: Networkbased identification and characterization of teleconnections on different scales. Sci. Rep., 9, 8808, https://doi.org/10.1038/s41598019454235.
ArcGIS, 2022: Map of WMO weather stations. ArcGIS, accessed 24 October 2022, https://www.arcgis.com/apps/mapviewer/index.html?layers=c3cbaceff97544a1a4df93674818b012.
Barber, C., J. R. Lamontagne, and R. M. Vogel, 2019: Improved estimators of correlation and R2 for skewed hydrologic data. Hydrol. Sci. J., 65, 87–101, https://doi.org/10.1080/02626667.2019.1686639.
Bendito, E., A. Carmona, A. M. Encinas, and J. M. Gesto, 2007: Estimation of Fekete points. J. Comput. Phys., 225, 2354–2376, https://doi.org/10.1016/j.jcp.2007.03.017.
Bialonski, S., M. T. Horstmann, and K. Lehnertz, 2010: From brain to earth and climate systems: Smallworld interaction networks or not? Chaos, 20, 013134, https://doi.org/10.1063/1.3360561.
Boers, N., B. Bookhagen, N. Marwan, J. Kurths, and J. Marengo, 2013: Complex networks identify spatial patterns of extreme rainfall events of the South American monsoon system. Geophys. Res. Lett., 40, 4386–4392, https://doi.org/10.1002/grl.50681.
Boers, N., B. Bookhagen, H. M. J. Barbosa, N. Marwan, J. Kurths, and J. A. Marengo, 2014: Prediction of extreme floods in the eastern central Andes based on a complex networks approach. Nat. Commun., 5, 5199, https://doi.org/10.1038/ncomms6199.
Boers, N., B. Goswami, A. Rheinwalt, B. Bookhagen, B. Hoskins, and J. Kurths, 2019: Complex networks reveal global pattern of extremerainfall teleconnections. Nature, 566, 373–377, https://doi.org/10.1038/s415860180872x.
Cellucci, C. J., A. M. Albano, and P. E. Rapp, 2005: Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms. Phys. Rev. E, 71, 066208, https://doi.org/10.1103/PhysRevE.71.066208.
Chen, Y., Y. R. Gel, V. Lyubchich, and K. Nezafati, 2018: Snowboot: Bootstrap methods for network inference. R J., 10, 95–113, https://doi.org/10.32614/RJ2018056.
Chwialkowski, K., and A. Gretton, 2014: A kernel independence test for random processes. Proc. 31st Int. Conf. on Machine Learning, Vol. 32, Beijing, China, Association for Computing Machinery, 1422–1430, https://dl.acm.org/doi/10.5555/3044805.3045051.
Cressie, N. A. C., 1993: Statistics for Spatial Data. Wiley Series in Probability and Statistics, Wiley, 900 pp.
Deza, J. I., M. Barreiro, and C. Masoller, 2015: Assessing the direction of climate interactions by means of complex networks and information theoretic tools. Chaos, 25, 033105, https://doi.org/10.1063/1.4914101.
Donges, J. F., Y. Zou, N. Marwan, and J. Kurths, 2009a: The backbone of the climate network. Europhys. Lett., 87, 48007, https://doi.org/10.1209/02955075/87/48007.
Donges, J. F., Y. Zou, N. Marwan, and J. Kurths, 2009b: Complex networks in climate dynamics. Eur. Phys. J. Spec. Top., 174, 157–179, https://doi.org/10.1140/epjst/e2009010982.
Donges, J. F., and Coauthors, 2015: Unified functional network and nonlinear time series analysis for complex systems science: The pyunicorn package. Chaos, 25, 113101, https://doi.org/10.1063/1.4934554.
Donoho, D., M. Gavish, and I. Johnstone, 2018: Optimal shrinkage of eigenvalues in the spiked covariance model. Ann. Stat., 46, 1742–1778.
Eichner, J. F., E. KoscielnyBunde, A. Bunde, S. Havlin, and H. Schellnhuber, 2003: Powerlaw persistence and trends in the atmosphere: A detailed study of long temperature records. Phys. Rev. E, 68, 046133, https://doi.org/10.1103/PhysRevE.68.046133.
Ekhtiari, N., A. Agarwal, N. Marwan, and R. V. Donner, 2019: Disentangling the multiscale effects of seasurface temperatures on global precipitation: A coupled networks approach. Chaos, 29, 063116, https://doi.org/10.1063/1.5095565.
Ekhtiari, N., C. Ciemer, C. Kirsch, and R. V. Donner, 2021: Coupled network analysis revealing global monthly scale covariability patterns between seasurface temperatures and precipitation in dependence on the ENSO state. Eur. Phys. J. Spec. Top., 230, 3019–3032, https://doi.org/10.1140/epjs/s1173402100168z.
Fan, J., J. Meng, Y. Ashkenazy, S. Havlin, and H. J. Schellnhuber, 2017: Network analysis reveals strongly localized impacts of El Niño. Proc. Natl. Acad. Sci. USA, 114, 7543–7548, https://doi.org/10.1073/pnas.1701214114.
Fan, J., J. Meng, Y. Ashkenazy, S. Havlin, and H. J. Schellnhuber, 2018: Climate network percolation reveals the expansion and weakening of the tropical component under global warming. Proc. Natl. Acad. Sci. USA, 115, E12128–E12134, https://doi.org/10.1073/pnas.1811068115.
Fan, J., J. Meng, J. Ludescher, Z. Li, E. Surovyatkina, X. Chen, J. Kurths, and H. J. Schellnhuber, 2022: Networkbased approach and climate change benefits for forecasting the amount of Indian monsoon rainfall. J. Climate, 35, 1009–1020, https://doi.org/10.1175/JCLID210063.1.
Fountalis, I., A. Bracco, B. Dilkina, C. Dovrolis, and S. Keilholz, 2018: δMAPS: From spatiotemporal data to a weighted and lagged network between functional domains. Appl. Network Sci., 3, 21, https://doi.org/10.1007/s411090180078z.
Friedman, N., M. Goldszmidt, and A. Wyner, 1999: Data analysis with Bayesian networks: A bootstrap approach. Proc. 15th Conf. on Uncertainty in Artificial Intelligence, Stockholm, Sweden, Association for Computing Machinery, 196–205, https://dl.acm.org/doi/10.5555/2073796.2073819.
Gao, W., S. Oh, and P. Viswanath, 2018: Demystifying fixed knearest neighbor information estimators. IEEE Trans. Inf. Theory, 64, 5629–5661, https://doi.org/10.1109/TIT.2018.2807481.
Gilpin, A. R., 1993: Table for conversion of Kendall’s tau to Spearman’s rho within the context of measures of magnitude of effect for metaanalysis. Educ. Psychol. Meas., 53, 87–92, https://doi.org/10.1177/0013164493053001007.
Guez, O. C., A. Gozolchiani, and S. Havlin, 2014: Influence of autocorrelation on the topology of the climate network. Phys. Rev. E, 90, 062814, https://doi.org/10.1103/PhysRevE.90.062814.
Guinness, J., and M. Fuentes, 2016: Isotropic covariance functions on spheres: Some properties and modeling considerations. J. Multivar. Anal., 143, 143–152, https://doi.org/10.1016/j.jmva.2015.08.018.
Heitzig, J., J. F. Donges, Y. Zou, N. Marwan, and J. Kurths, 2011: Nodeweighted measures for complex networks with spatially embedded, sampled, or differently sized nodes. Eur. Phys. J. B 85, 38, https://doi.org/10.1140/epjb/e2011206787.
Hersbach, H., and Coauthors, 2018: ERA5 hourly data on single levels from 1959 to present. Copernicus Climate Change Service Climate Data Store, accessed 13 April 2022, https://doi.org/10.24381/cds.adbb2d47.
Hersbach, H., and Coauthors, 2019a: ERA5 monthly averaged data on pressure levels from 1959 to present. Copernicus Climate Change Service Climate Data Store, accessed 13 April 2022, https://doi.org/10.24381/cds.6860a573.
Hersbach, H., and Coauthors, 2019b: ERA5 monthly averaged data on single levels from 1959 to present. Copernicus Climate Change Service Climate Data Store, accessed 13 April 2022, https://doi.org/10.24381/cds.f17050d7.
Hlinka, J., D. Hartman, N. Jajcay, M. Vejmelka, R. Donner, N. Marwan, J. Kurths, and M. Paluš, 2014: Regional and interregional effects in evolving climate networks. Nonlinear Processes Geophys., 21, 451–462, https://doi.org/10.5194/npg214512014.
Hlinka, J., D. Hartman, N. Jajcay, D. Tomeček, J. Tintěra, and M. Paluš, 2017: Smallworld bias of correlation networks: From brain to climate. Chaos, 27, 035812, https://doi.org/10.1063/1.4977951.
Horvath, S., 2011: Weighted Network Analysis: Applications in Genomics and Systems Biology. Springer, 421 pp.
Kittel, T., C. Ciemer, N. Lotfi, T. Peron, F. Rodrigues, J. Kurths, and R. V. Donner, 2021: Evolving climate network perspectives on global surface air temperature effects of ENSO and strong volcanic eruptions. Eur. Phys. J. Spec. Top., 230, 3075–3100, https://doi.org/10.1140/epjs/s11734021002699.
Koller, D., and N. Friedman, 2009: Probabilistic Graphical Models: Principles and Techniques. MIT Press, 1272 pp.
Kraskov, A., H. Stoegbauer, and P. Grassberger, 2004: Estimating mutual information. Phys. Rev. E, 69, 066138, https://doi.org/10.1103/PhysRevE.69.066138.
Kretschmer, M., J. Runge, and D. Coumou, 2017: Early prediction of extreme stratospheric polar vortex states based on causal precursors. Geophys. Res. Lett., 44, 8592–8600, https://doi.org/10.1002/2017GL074696.
Lahiri, S. N., 2003: Resampling Methods for Dependent Data. Springer, 374 pp.
Lam, C., 2020: Highdimensional covariance matrix estimation. Wiley Interdiscip. Rev.: Comput. Stat., 12, e1485, https://doi.org/10.1002/wics.1485.
Lang, A., and C. Schwab, 2015: Isotropic Gaussian random fields on the sphere: Regularity, fast simulation and stochastic partial differential equations. Ann. Appl. Probab., 25, 3047–3094, https://doi.org/10.1214/14AAP1067.
Ledoit, O., and M. Wolf, 2004: A wellconditioned estimator for largedimensional covariance matrices. J. Multivar. Anal., 88, 365–411, https://doi.org/10.1016/S0047259X(03)000964.
Levin, K., and E. Levina, 2019: Bootstrapping networks with latent space structure. arXiv, 1907.10821v2, https://doi.org/10.48550/arXiv.1907.10821.
Ludescher, J., and Coauthors, 2021: Networkbased forecasting of climate phenomena. Proc. Natl. Acad. Sci. USA, 118, e1922872118, https://doi.org/10.1073/pnas.1922872118.
Ludescher, J., A. Gozolchiani, M. I. Bogachev, A. Bunde, S. Havlin, and H. J. Schellnhuber, 2014: Very early warning of next El Niño. Proc. Natl. Acad. Sci. USA, 111, 2064–2066, https://doi.org/10.1073/pnas.1323058111.
Meinshausen, N., and P. Bühlmann, 2010: Stability selection. J. Roy. Stat. Soc. B, 72, 417–473, https://doi.org/10.1111/j.14679868.2010.00740.x.
Minsker, S., and X. Wei, 2017: Estimation of the covariance structure of heavytailed distributions. 31st Conf. on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, Association for Computing Machinery, 2860–2870, https://proceedings.neurips.cc/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ffPaper.pdf.
MoralesJimenez, D., I. M. Johnstone, M. R. McKay, and J. Yang, 2021: Asymptotics of eigenstructure of sample correlation matrices for highdimensional spiked models. Stat. Sin., 31, 571–601, https://doi.org/10.5705/ss.202019.0052.
Mumford, J. A., S. Horvath, M. C. Oldham, P. Langfelder, D. H. Geschwind, and R. A. Poldrack, 2010: Detecting network modules in fMRI time series: A weighted network analysis approach. Neuroimage, 52, 1465–1476, https://doi.org/10.1016/j.neuroimage.2010.05.047.
Niu, X., J. Zhang, L. Zhang, Y. Hou, S. Pu, A. Chu, M. Bai, and Z. Zhang, 2019: Weighted gene coexpression network analysis identifies critical genes in the development of heart failure after acute myocardial infarction. Front. Genet., 10, 1214, https://doi.org/10.3389/fgene.2019.01214.
Onnela, J.P., J. Saramäki, J. Kertész, and K. Kaski, 2005: Intensity and coherence of motifs in weighted complex networks. Phys. Rev. E, 71, 065103, https://doi.org/10.1103/PhysRevE.71.065103.
Paluš, M., D. Hartman, J. Hlinka, and M. Vejmelka, 2011: Discerning connectivity from dynamics in climate networks. Nonlinear Processes Geophys., 18, 751–763, https://doi.org/10.5194/npg187512011.
Papalexiou, S. M., 2018: Unified theory for stochastic modelling of hydroclimatic processes: Preserving marginal distributions, correlation structures, and intermittency. Adv. Water Resour., 115, 234–252, https://doi.org/10.1016/j.advwatres.2018.02.013.
Politis, D. N., J. P. Romano, and M. Wolf, 1999: Subsampling. Springer Series in Statistics, Springer, 347 pp.
Quian Quiroga, R., T. Kreuz, and P. Grassberger, 2002: Event synchronization: A simple and fast method to measure synchronicity and time delay patterns. Phys. Rev. E, 66, 041904, https://doi.org/10.1103/PhysRevE.66.041904.
Radebach, A., R. V. Donner, J. Runge, J. F. Donges, and J. Kurths, 2013: Disentangling different types of El Niño episodes by evolving climate network analysis. Phys. Rev. E, 88, 052807, https://doi.org/10.1103/PhysRevE.88.052807.
Rheinwalt, A., N. Marwan, J. Kurths, P. Werner, and F.W. Gerstengarbe, 2012: Boundary effects in network measures of spatially embedded networks. Europhys. Lett., 100, 28002, https://doi.org/10.1209/02955075/100/28002.
Rheinwalt, A., B. Goswami, N. Boers, J. Heitzig, N. Marwan, R. Krishnan, and J. Kurths, 2015: Teleconnections in climate networks: A networkofnetworks approach to investigate the influence of sea surface temperature variability on monsoon systems. Machine Learning and Data Mining Approaches to Climate Science, V. Lakshmanan et al., Eds., Springer, 23–33.
Romano, S., N. X. Vinh, K. Verspoor, and J. Bailey, 2018: The randomized information coefficient: Assessing dependencies in noisy data. Mach. Learn., 107, 509–549, https://doi.org/10.1007/s1099401756642.
Runge, J., P. Nowack, M. Kretschmer, S. Flaxman, and D. Sejdinovic, 2019a: Detecting and quantifying causal associations in large nonlinear time series datasets. Sci. Adv., 5, eaau4996, https://doi.org/10.1126/sciadv.aau4996.
Runge, J., and Coauthors, 2019b: Inferring causation from time series in Earth system sciences. Nat. Commun., 10, 2553, https://doi.org/10.1038/s41467019101053.
Scarsoglio, S., F. Laio, and L. Ridolfi, 2013: Climate dynamics: A networkbased approach for the analysis of global precipitation. PLOS ONE, 8, e71129, https://doi.org/10.1371/journal.pone.0071129.
Schreiber, T., and A. Schmitz, 1996: Improved surrogate data for nonlinearity tests. Phys. Rev. Lett., 77, 635–638, https://doi.org/10.1103/PhysRevLett.77.635.
Shao, J., and D. Tu, 1995: The Jackknife and Bootstrap. Springer Series in Statistics, Springer, 517 pp.
Sporns, O., 2010: Networks of the Brain. The MIT Press, 424 pp.
Stein, M. L., 1999: Interpolation of Spatial Data. Springer Series in Statistics, Springer, 249 pp.
Tsonis, A. A., and P. J. Roebber, 2004: The architecture of the climate network. Physica A, 333, 497–504, https://doi.org/10.1016/j.physa.2003.10.045.
Tsonis, A. A., K. L. Swanson, and G. Wang, 2008: On the role of atmospheric teleconnections in climate. J. Climate, 21, 2990–3001, https://doi.org/10.1175/2007JCLI1907.1.
Vallis, G. K., 2011: Climate and the Oceans. Princeton University Press, 248 pp.
von Luxburg, U., 2007: A tutorial on spectral clustering. Stat. Comput., 17, 395–416, https://doi.org/10.1007/s112220079033z.
Wiedermann, M., J. F. Donges, J. Kurths, and R. V. Donner, 2015: Spatial network surrogates for disentangling complex system structure from spatial embedding of nodes. Phys. Rev. E, 93, 042308, https://doi.org/10.1103/PhysRevE.93.042308.
Yamasaki, K., A. Gozolchiani, and S. Havlin, 2008: Climate networks around the globe are significantly affected by El Niño. Phys. Rev. Lett., 100, 228501, https://doi.org/10.1103/PhysRevLett.100.228501.