On Variability due to Local Minima and K-Fold Cross Validation

Caren Marzban aApplied Physics Laboratory, University of Washington, Seattle, Washington
bDepartment of Statistics, University of Washington, Seattle, Washington

Search for other papers by Caren Marzban in
Current site
Google Scholar
PubMed
Close
,
Jueyi Liu cDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts

Search for other papers by Jueyi Liu in
Current site
Google Scholar
PubMed
Close
, and
Philippe Tissot dConrad Blucher Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas

Search for other papers by Philippe Tissot in
Current site
Google Scholar
PubMed
Close
Free access

Abstract

Resampling methods such as cross validation or bootstrap are often employed to estimate the uncertainty in a loss function due to sampling variability, usually for the purpose of model selection. In models that require nonlinear optimization, however, the existence of local minima in the loss function landscape introduces an additional source of variability that is confounded with sampling variability. In other words, some portion of the variability in the loss function across different resamples is due to local minima. Given that statistically sound model selection is based on an examination of variance, it is important to disentangle these two sources of variability. To that end, a methodology is developed for estimating each, specifically in the context of K-fold cross validation, and neural networks (NN) whose training leads to different local minima. Random effects models are used to estimate the two variance components—that due to sampling and that due to local minima. The results are examined as a function of the number of hidden nodes, and the variance of the initial weights, with the latter controlling the “depth” of local minima. The main goal of the methodology is to increase statistical power in model selection and/or model comparison. Using both simulated and realistic data, it is shown that the two sources of variability can be comparable, casting doubt on model selection methods that ignore the variability due to local minima. Furthermore, the methodology is sufficiently flexible so as to allow assessment of the effect of other/any NN parameters on variability.

© 2022 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Caren Marzban, marzban@stat.washington.edu

Abstract

Resampling methods such as cross validation or bootstrap are often employed to estimate the uncertainty in a loss function due to sampling variability, usually for the purpose of model selection. In models that require nonlinear optimization, however, the existence of local minima in the loss function landscape introduces an additional source of variability that is confounded with sampling variability. In other words, some portion of the variability in the loss function across different resamples is due to local minima. Given that statistically sound model selection is based on an examination of variance, it is important to disentangle these two sources of variability. To that end, a methodology is developed for estimating each, specifically in the context of K-fold cross validation, and neural networks (NN) whose training leads to different local minima. Random effects models are used to estimate the two variance components—that due to sampling and that due to local minima. The results are examined as a function of the number of hidden nodes, and the variance of the initial weights, with the latter controlling the “depth” of local minima. The main goal of the methodology is to increase statistical power in model selection and/or model comparison. Using both simulated and realistic data, it is shown that the two sources of variability can be comparable, casting doubt on model selection methods that ignore the variability due to local minima. Furthermore, the methodology is sufficiently flexible so as to allow assessment of the effect of other/any NN parameters on variability.

© 2022 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Caren Marzban, marzban@stat.washington.edu

1. Introduction

Analysis of variance (ANOVA) is important for a wide range of problems. For example, in deciding which of two models is better in terms of some measure of performance, the variability of that measure is a quantity that plays a prominent role at all levels of analysis, from a simple comparison of boxplots, to performing a two-sample test of statistical significance. In general, larger variability is associated with reduced statistical power. In other words, even if there is a difference between the performance measures for the two models, larger variability renders it more difficult to detect. Indeed, for sufficiently large variability, the comparison of models becomes inconclusive, at best. Therefore, it is important to account for as many sources of variability as possible for the purpose of effectively reducing the variability. Although there exist other arenas in which an accurate assessment of variability is important, here the focus will be on model selection and/or model comparison.

Although the sources of variability are numerous, two that often attract attention are sampling variability and variability due to measurement error (Montgomery 2009; Fuller 1987; Buonaccorsi 2010). However, in problems whose models involve nonlinear optimization, there exist other sources of variability, hereinafter referred to as computational variability. One example is variability due to the existence of local minima. Although its existence is well known, its consequences are often not taken into account, at least not systematically. This source of variability is often a consequence of the sensitivity of the optimization routine with respect to the initial values of the parameters being optimized. In addition, there are indications that underspecification may also contribute to the existence of local minima (D’Amour et al. 2020). In neural networks (NN), for example, the parameters being optimized are the weights, and their estimation involves initializing them randomly. As a result, repeated training can lead to different local minima of the loss function.

Cross validation is often employed to assess the extent to which a model performs beyond the dataset on which it is trained. In one of its many variations, the so-called K-fold cross validation, a sample is partitioned into K nonoverlapping segments, a model is trained on K − 1 of the segments, and then it is validated on the unused segment. The sample mean across the K validation performance measures, for example, mean-square error (MSE), is then taken to estimate the expected test error, also known as prediction error. In model selection, the model with the lowest estimated prediction error is selected.

The above model-selection criterion, however, ignores the fact that the estimated prediction error is only an estimate of the true (or expected) prediction error. A number of attempts have been made to build a confidence interval for the latter, but the results are preliminary, at best, because of the lack of independence between the various K − 1 training segments (Hastie et al. 2001). There is even evidence that all estimators of variance in K-fold cross validation are biased (Bengio and Grandvalet 2004). A quantity that does play an important role in building a confidence interval is the sample variance of the K validation performance measures (James et al. 2015). It is, therefore, important to understand the sources of this variability.

The distribution of local minima has been studied in a number of articles. For example, in the limit of a large number of hidden nodes, certain distributional models for the training error and the weights suggest that the distribution of local minima tend to cluster around global minima (Choromanska et al. 2015). It has also been shown that the majority of critical points in NN optimization are in fact not local minima but saddle points (Dauphin et al. 2014). Both of these works are theoretical in nature in that they examine theoretical limits, for example, large number of hidden nodes, in the former, and very high-dimensional optimization, in the latter. The scope of the present work is entirely practical, and for that reason the variance of performance measure due to local minima is estimated empirically, by retraining an NN starting from different and random initial weights.

In the most common implementation of K-fold cross validation, the initial weights of each of the K NNs are randomly set. (The alternative, of initializing all of the K NNs from a unique set of weights is undesirable because it does not allow an exploration of weight space.) Consequently, the variability across the K folds confounds the variability due to sampling and that due to initialization. In this paper, the latter is denoted σIW2 [initialization of weights (IW)]. In what may seem like confusing nomenclature, the variability due to sampling is denoted σCV2 [cross validation (CV)], and the total variability observed in cross validation is the sum of these two components.

Given this joint contribution to the total variability, it is important that the two components are estimated using a multivariate model. Here, a methodology is developed for decomposing the total variability in the loss function (both training and validation) into two components representing variability due to sampling and due to local minima. The methodology is applied to simulated as well as real data, and it is shown that the two sources of variability can be comparable in magnitude. That finding highlights the importance of taking into account both sources of variability when performing model selection and/or comparison.

The importance of variance

Consider the central limit theorem (CLT), according to which, under relatively general conditions, the quantity
z=y¯μσ/n
has a standard normal distribution, where y¯ denotes the mean of a sample of size n drawn from a distribution with mean and variance given by μ and σ2, respectively. Let y denote some performance measure with an approximately normal distribution [e.g., log(MSE)], in which case μ denotes the true (population) mean of that performance measure. Then, the CLT becomes the central foundation for comparing different models—for example, NNs with different number of hidden nodes—in terms of their performance. The appearance of σ in the standard variate z implies that the confidence interval for μ becomes wider, and the p value of a hypothesis test becomes larger, as σ increases. Consequently, an increase in σ is associated with a decrease in statistical power (i.e., the probability of detecting an effect, if it truly does exist). Said differently, a large σ can lead to an inability to distinguish between models. Therefore, it is important to take appropriate measures to reduce σ.

Note that σ2 denotes the population variance and not the sample variance. As such, it cannot be reduced through sampling techniques. The only way it can be “reduced,” effectively, is through its decomposition into other, smaller components. For example, if σ2=σ12+σ22+, then each variance component σi2, representing the variance of an identifiable source of variability in y, will be smaller than σ2. Of relevance to the present work, σ12 and σ22 may measure variability due to sampling and due to initialization, respectively. With smaller variance components, then comparison of different models in terms of each component, is guaranteed to have higher statistical power.

It is extremely important to also note that statistical power is the probability of detecting an effect, if it exists. In other words, the aforementioned guarantee is valid only in the long run, across multiple applications of inference, across multiple samples, that is, only in expectation. Increased power cannot be observed in an application of inference (confidence intervals and hypothesis tests) to a single sample. As such, given that in this paper the proposed methodology is applied to a single sample, increased power cannot be demonstrated. Indeed, demonstration of increased power is not the goal of this work. The main goal of this work is to demonstrate that variability due to local minima and variability due to sampling contribute comparably to the total variance observed in cross validation. It then follows that separating the two variance components leads to a “reduction” of the total variance and, therefore, to higher statistical power.

2. Data

The term data may refer to several completely different quantities. On the one hand, it can refer to data in the traditional sense of the word, generally characterized by the presence of experimental error in the observations. In this paper, references to simulated data and real data are both instances of traditional data. Each of these datasets will be used in performing several experiments involving NNs. The results of the experiments constitute data as well, but data of a different kind. The phrase computer data has been proposed to describe this type of data (Sacks et al. 1989; Santner et al. 2003; Fang et al. 2006; Welch et al. 1992). It is characterized by the absence of experimental error, because performing a computer experiment repeatedly leads to identical results. As discussed below, it is such computer data that are employed to estimate the aforementioned variance components. Henceforth, and to minimize confusion, the term data will be used to describe traditional data, and the phrase computer data will refer to the MSE values generated from the experiments on the NN.

a. Simulated and real data

The simulated dataset involves 101 cases on one predictor x and one response y. It is generated by varying the input of an NN with six hidden nodes on one hidden layer from −1 to +1, in 100 steps and recording the output. The weights of the NN are a random sample taken from a uniform distribution between −10 and 10. The true/underlying relationship is shown in Fig. 1. Then, the simulated dataset is generated by adding to the y values a random sample taken from a normal distribution with μ = 0 and σ = 1; these data are displayed as the circles in Fig. 1. This way of generating data assures that the underlying function is one that is learnable by an NN with six hidden nodes.

Fig. 1.
Fig. 1.

The simulated data (circles), generated by the addition of normally distributed error to the function shown by the solid line. The function itself is an example of a function represented by an NN with six hidden nodes.

Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-21-0004.1

The real dataset is the correction of a digital elevation model for a coastal marsh, estimated from a 3D point cloud measured by a terrestrial lidar survey (Nguyen et al. 2019). The aim of the NN is to predict the difference between the vertical measurements estimated from the point cloud (which covers the full study site) and more accurate but sparse measurements that are based on real-time kinematic ground measurements. The input data are composed of 12 features: 6 sensor-specific inputs, and 6 other features computed from the point cloud. Examples of the former include range, median of the amplitude, and waveform deviation; the latter involve point density and surface roughness, which are indicative of the land cover. The number of cases is 607.

b. Computer data

The computer data necessary for estimating the variance components are generated as follows:

  1. For an NN with a given number of hidden nodes H, the initial weights are sampled from a zero-mean normal distribution with standard deviation σw (whose value is addressed below). Training is initiated on the training set (simulated and real), and it is continued until convergence has occurred.1 The values of MSE on the training and validation sets are recorded.

  2. Step 1 is repeated for all K = 10 training sets in K-fold cross validation, with the value of MSE recorded for both the training and the validation sets. The initial weights are fixed for each of the K = 10 runs.

  3. Steps 1 and 2 are repeated for W = 11 different initial weights. The resulting computer data can be organized in the form of the K×W matrix shown in Table 1, where only the training MSE values are shown. A similar matrix exists (not shown) for the validation MSE values.

  4. Steps 1–3 are performed for a range of H values (0, 2, 4, 6, 8, and 10 for the simulated dataset, and 0, 4, 8, and 12 for the real dataset).2 Varying H allows one to examine how the histogram of local minima (i.e., the histogram of the training MSE values) depends on the complexity of NN. The histogram of the validation MSE values can also be monitored for the purpose of model selection.

Table 1

An example of the computer data generated through the steps described in section 2b. These are training MSE values for K = 10 folds and W = 11 different initial weights, for an NN with H = 6 hidden nodes.

Table 1

Note that step 2 is contrary to how traditional cross validation is performed, wherein the weights are randomly initialized for each of the K folds. In the present design, step 3 is where weights are randomized initially. The advantage of this design is that traditional cross validation is included as a subset. Specifically, the diagonal elements of Table 1 are MSE values that would be generated in traditional cross validation. This design is necessary for decomposing the total variability into its components.

The above four steps are sufficient for the development of models that estimate the variability of the MSE values due to sampling and due to local minima. However, given that σw is a user-specified parameter in NN development, steps 1–4 are repeated for two different values of σw. Ideally, to make a deeper local minimum more likely, one should increase the density of points sampled in weight space. To increase that density, one can either increase the number of sampled points, or decrease σw, or both. For a given σw, increasing the number of sampled points in weight space grows with the dimensionality of that space and, therefore, with H, making it difficult to compare the results across different H values. To avoid that complexity, here the number of points in weight space is fixed at W = 11, and only σw is varied. To assure that different σw values do lead to NNs that occupy distinct regions in weight space, with distinctly different MSE values, the histogram of MSE is generated across 500 random initializations, at two different values of σw (0.1 and 10). The positions of these two histograms convey which of the two σw values is associated with generally deeper local minima. It is important to emphasize that varying σw is not necessary for the estimation of the two variance components. The W = 500 initializations with different values of σw are examined here only for the purpose of better understanding how the proposed methodology depends on that user-specified quantity.

3. Method

As explained previously, the computer data are employed for the estimation of the variance components. Therefore, the experimental design adopted here includes one response variable MSE and two discrete factors denoted CV and IW, which, respectively, take values i = 1, 2, …, K and j = 1, 2, …, W. Here, K = 10 is the number of partitions in K-fold cross validation and W = 11 is the number of different NN initializations. Consequently, the computer data generated in this design can be organized in the form of a K×W matrix with each element containing the corresponding MSE value (e.g., Table 1).

The model for the computer data is
MSEij=μ+CVi+IWj+ϵij,
where the response MSEij denotes the value of MSE when the factors CV and IW take values i and j, respectively. The term μ denotes the true mean of the response across all values of the factors, and ϵ accounts for any factor that has not been included in the model.3 We point out that this model is not a traditional regression model mapping the factors CV and IW to MSE, because that would require the factors to be quantitative variables. Here, although the factors take integer values, those values are simply labels; the factors are in fact categorical variables. The terms CVi and IWj that appear on the right-hand side of this model denote the conditional mean of the response MSE at different levels of the factors CV and IW. Models of this type are often called linear models (or ANOVA-type models) (Montgomery 2009).

In one realization of linear models the factors on the right-hand side of the model are fixed (nonrandom) quantities, with the exception of the error term, which is assumed to be a zero-mean, normally distributed random variable with variance σϵ2. Such models are called fixed-effects models, and statistical tests exist for testing whether any of the factors, or a given factor, has an effect on the response. In fixed-effects models, the results of the tests (e.g., reject or not reject the null hypothesis) apply to only the specific values/levels of the factors appearing in the computer data. For example, if the CV factor is found to have a small p value, all one can conclude is that there is evidence that the true mean of the response varies across the specific values/levels taken by that factor in the computer data. To generalize that conclusion to all possible values/levels, one must treat the factors as random variables.

When all of the factors on the right-hand side (except μ) are treated as random variables, then the model is called a random-effects model. The simplest probability model for CV and IW is that they are zero-mean, normally distributed variables, with corresponding variances satisfying
σMSE2=σCV2+σIW2+σϵ2.
The various terms on the right side of this expression are called variance components. It is estimates of these variance components that quantify the contribution of the two sources of variability to the total variability of MSE. Two common estimators for the variance components are called the ANOVA estimator, and the restricted-maximum-likelihood (REML) estimator (Montgomery 2009); the former is the simpler of the two, but it has the defect of sometimes leading to negative values for the estimates of the variance components—an undesirable occurrence. For this reason, the latter estimator is used in the present study.4

Given that the variance components are population parameters, one can build confidence intervals for them. Although analytic formulas exist for the confidence interval when the ANOVA estimator is used, analytic results for the REML-based confidence intervals are not known; however, approximate formulas do exist (Montgomery 2009), and they are used here. Also, given that the main focus of the present work is the relative magnitude of the two variance components, the final results are reported in terms of the ratio σIW2/σCV2. The confidence interval for that ratio is straightforward to compute analytically, because the ratio of the respective sample variances follows the F distribution (Montgomery 2009).

The reliability of the confidence intervals requires assuming that the response MSE has a normal distribution with a constant variance within every possible combination of i and j in Eq. (1). To that end, a Box–Cox transformation is performed on the MSE values to maximize normality. Then, Q–Q plots are employed to check the validity of the assumptions.

4. Results

As mentioned in section 2, although the estimation of the variance components requires the computer data generated by the above experimental design, the training of the NNs requires data as well. Recall that two datasets will be examined: 1) a simulated, bivariate dataset constructed such that the true/underlying relationship between the predictor and response is learnable by an NN with H = 6 and 2) a real, multivariate dataset consisting of 12 predictors and 1 response variable. The statistics of these two datasets will not be presented, because the focus of the present study is the modeling of the computer data on MSE and the factors CV and IW. Suffice it to say that all of the usual preprocessing steps for the development of NNs are performed (Masters 1993); specifically, the inputs to the NNs are log transformed to assure bell-shaped histograms and then are standardized to have 0 mean and a standard deviation of 1. Also, collinearity is examined to assure no two predictors are highly correlated. As for the computer data, Q–Q plots are used to confirm that the MSE values satisfy the assumptions necessary for the construction of confidence intervals.

As explained in section 3, two values of σw are selected to examine the sensitivity of the results on the depth of the local minima. Figure 2 shows the histogram of the training MSE values across 500 initializations of NNs with H = 12 trained on the real data. This pattern is observed for H = 4 and H = 8, as well (not shown). It is evident that the smaller value of σw is associated with distinctly lower values of MSE. The extent to which σw controls the overall depth of the local minima is addressed in section 5.

Fig. 2.
Fig. 2.

The histogram of MSE for 500 NNs initialized with random weights drawn from a zero-mean normal distribution with σw = 0.1 (left bars) and σw = 10 (right bars).

Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-21-0004.1

Before estimating the variance components in Eq. (2), it is instructive to examine the computer data on MSE, CV, and IW, for different values of H and of the standard deviation σw of the initial weights. To be specific, consider the computer data on training MSE values shown in Table 1, corresponding to the simulated dataset. The marginal medians of this matrix convey useful information; the histogram of the column medians conveys information about the distribution of local minima; similarly, the histogram of the row medians displays information about the sampling variability of MSE. Both of these histograms pertain to the training MSE values; there are two analogous histograms for the validation MSE values. Then, summarizing each histogram with a boxplot leads to four boxplots denoted trn_CV, trn_IW, vld_CV, and vld_IW, for every experiment performed here (i.e., every combination of H and σw).

Figure 3 shows all of the boxplots for different values for the number of hidden nodes H, and the standard deviation of the initial weights σw. To enhance visual acuity, the logarithm of MSE is shown. Evidently, for a given value of σw, the training MSE values (trn_CV and trn_IW) generally decrease as H increases. This finding is consistent with the expectation that a more complex NN leads to lower values of training MSE. For small σw (i.e., 0.1), associated with deeper local minima (see Fig. 2), the validation boxplots are consistent with the commonly expected “U” pattern, suggesting that larger values of H may lead to overfitting. The true/known value of H = 6 is also consistent with the observed pattern. For σw = 10, the validation boxplots do not follow the expected U pattern; although this is somewhat surprising, there has been recent evidence that such expectations are unwarranted when gradient methods are invoked for NN training (Sankararaman et al. 2019).

Fig. 3.
Fig. 3.

Boxplots summarizing the MSE of NNs trained and validated on the simulated data shown in Fig. 1, with different standard deviation of initial weights σw.

Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-21-0004.1

Of particular interest for the focus of this study are the spreads of the boxplots shown in Fig. 3. Intuitively, it is these that approximate the variance components σCV2 and σIW2 that one seeks to estimate. Although the estimation of these variance components involves all of the MSE values in the computer data (e.g., Table 1) and not the marginal medians only, Fig. 3 already allows one to qualitatively compare the variability of MSE due to sampling with that due to local minima. For example, consider the validation boxplots in Fig. 3; first, note that for larger values of H, say 8 and 10, the large spread of the boxplots precludes any meaningful comparisons. However, for smaller H values, including the true value H = 6, the blue boxplots for σw = 10 are generally wider than those for σw = 0.1. Indeed, when the initial weights are sampled from a wider distribution, sampling variability is generally comparable to that due to local minima.

Figure 4 shows the analogous results for the real data. First, focusing on the training boxplots (top row), it can be seen that a larger variance for the initial weights is generally associated with larger MSE values. This observation is consistent with the finding in Fig. 1 based on 500 initializations (and no cross validation).

Fig. 4.
Fig. 4.

Boxplots summarizing the MSE of NNs trained and validated on the real, multivariate data described in section 2.

Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-21-0004.1

These results can be employed for two purposes—1) model selection and 2) the comparison of variability due to sampling and local minima. For the former, it is evident that the “best” model is that with H = 0 (i.e., a linear regression model), because the lowest validation errors are generally associated with H = 0; and that conclusion is valid for both choices of σw, and regardless of the source of variability.

Another indication of overfitting—one that involves σw—is also evident in these figures. Consider the H = 12 boxplots: the training MSE values for σw = 0.1 are significantly lower than those for σw = 10. Again, this is consistent with Fig. 1. However, the corresponding validation MSE values follow the reverse order, that is, with σw = 0.1 boxplots being significantly higher than those for σw = 10. This implies that a smaller σw leads to overfitting. This, too, may seem surprising, but it simply highlights the importance of σw in NN training. This detailed behavior can be inferred for all H values by simply noting that the larger slope in the training boxplots for σw = 0.1, as compared with that with σw = 10, is accompanied with an also larger slope in the validation boxplots.

As for the comparison of the two sources of variability, Fig. 4 implies that for small σw, variability due to sampling and variability due to initialization are comparable in magnitude, and regardless of σw. This finding is the reason for why it is important to take into account both sources of variability.

All of the above conclusions are based on qualitative comparisons of boxplots that summarize the computer data (e.g., Table 1). Moreover, only the row and column marginals of the computer data are used in the boxplots. The random effects models described in section 3 (method) can be employed to estimate the variance components, rigorously, and from all of the elements in the table (not the marginals only). Also, given that the main focus of this work is to compare the two sources of variability, the results are shown in terms of the ratio σIW2/σCV2.

Figure 5 shows the variance ratio σIW2/σCV2 inferred from the computer data generated from the simulated data. The approximate 95% confidence intervals are also shown.5 We also point out that in such figures the variance ratio at H = 0 is identically zero, because an NN with H = 0 (i.e., linear regression) has a strictly convex error function, and so there exists a unique global minimum.

Fig. 5.
Fig. 5.

The ratio of variance components σIW2/σCV2 estimated from the computer data on MSE generated from training NNs on the simulated data, for different values of H. Also shown are the approximate 95% confidence intervals. The plots are for (top) σw = 0.1 and (bottom) 10, respectively.

Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-21-0004.1

A glance at the range of the y axis in the two panels in Fig. 5 reveals that a larger spread in the initial weights of the NN leads to larger variance ratios. More specifically, for σw = 0.1, variability due to local minima is less than that due to sampling. For σw = 10, however, the pattern is generally reversed, for all values of H, with variability due to local minima dominating sampling variability by as much as an order of magnitude. Indeed, in such extreme limits, one would be justified to perform model selection based on variability due local minima, only. The apparent trend in the bottom panel is addressed in section 5; to that end, we note that the confidence intervals in the bottom panel are generally wider than those in the top panel.

The analogous results for the experiments based on real data are shown in Fig. 6. As in the case of the simulated data, a larger σw is generally associated with a larger variance ratio. The one possible exception is when H = 4, in which case the variance ratio is consistent with the value 1, for both values of σw. Further analysis of these results is presented in the next section.

Fig. 6.
Fig. 6.

As in Fig. 5, but estimated from the computer data on MSE generated from training NNs on the real data.

Citation: Artificial Intelligence for the Earth Systems 1, 4; 10.1175/AIES-D-21-0004.1

5. Conclusions and discussion

The local minima in the loss function landscape of NNs are examined, and it is shown that variability due to local minima can be comparable to, or even larger than, that due to sampling in K-fold cross validation. This implies that both sources of variability must be taken into account when performing model selection and/or comparison. A methodology is developed for estimating the contribution of the two sources of variability. Moreover, the methodology is sufficiently general to allow for the estimation of other sources of computational variability.

One may question the practical utility of estimating variance components. After all, at no point in this work is a conclusion that would have been arrived at through traditional cross validation found to be reversed upon the examination of local minima. So, what is to be gained from decomposing total variance into its components? The answer to that question is derived from the discussion in section 1. Specifically, the payoff from examining both sources of variability is in higher statistical power, and that is not something that can be demonstrated on a single sample. Indeed, the basic equation for the decomposition of variance, Eq. (2), pertains to population variances and not sample variances. Yet, it is only the latter that have been examined in all of the figures presented here. In other words, the decomposition of variance ought to be commonly practiced because, in the long run, it is apt to lead to the selection of the correct model.

Given that the payoff from the proposed methodology is not immediate, the following guidance is provided, from the simplest to the most complex:

  1. The variance of the performance measure across cross validation ought to be taken into account when selecting or comparing models.

  2. That variance ought to be decomposed into components that separately gauge variability due to sampling and due to local minima. This decomposition is guaranteed to lead to smaller variance components, and therefore, higher statistical power. To that end, it is recommended that computer data tables (e.g., Table 1), be produced.

  3. The row and column marginals of the computer data table allow for qualitative comparisons of the sources of variability.

  4. Random effects models of the computer data allow for quantitative comparisons of the variance components.

In the real data examined here, two values of σw are identified as having significantly different values of MSE, and the behavior of the variance components, as a function of H, is examined at those values. In other words, the user specified σw is used to “control” the depth of the local minima. Specifically, it was found that the two values σ = 0.1 and 10 correspond to deeper and shallower local minima, respectively. The study of the precise relationship between σw and the depth of the local minima is beyond the scope of this work, but it is important to note that the above guidance applies regardless of that specific relationship. Said differently, to estimate the two variance components, it is not necessary to vary σw at all. That said, it makes for good practice to do so.

As mentioned in footnote 3, the ϵij term in the random effects model in Eq. (1) can also be interpreted as an interaction term. In that case, ϵij measures the extent to which the effect of the one factor (say, CV) on the response MSE varies with the other factor (IW). In ANOVA-type models like Eq. (1), there are two approaches that are generally followed: 1) The degrees of freedom associated with the interaction term are used to perform inference on the main effects of each factor, or 2) treat the ϵ term as an interaction, but then give up the ability to perform inference (Montgomery 2009). In the present work, focus has been on the main effects from CV and IW, and the ability to perform inference on them. Another reason why the first approach is followed here is that according to the hierarchy principle, interaction terms generally have a much smaller magnitude than main effects. This phenomenon is generally borne out by several “principles:” the principle of hierarchical ordering, the principle of effect sparsity, and the principle of effect hierarchy; see Montgomery (2009, pp. 192, 230, 272, 314, and 329) and Li et al. (2006, 33–34).

Several extensions of the present work are possible. For instance, among the sources of computational variability another quantity that is often set by the user is the number of training epochs. It is commonly expected that increasing the number of training epochs may lead to a deeper local minimum. However, a natural question that arises then is, what proportion of the total variance is due to the number of training epochs, which in turn suggests an extension of the model in Eq. (2) wherein there is also an “epoch” factor on the right side. This extension of the model, as well as other generalization will be examined elsewhere.

Some of the panels in Figs. 5 and 6 display trends in the way σIW2/σCV2 varies with H. For example, the bottom panel in Fig. 5 suggests that for the simulated data, when σw is large, the variance ratio decreases with H. By contrast, that trend appears to be reversed in Fig. 6 where real data are examined. Such apparent patterns must be tempered by the confidence intervals. Indeed, given the relatively wide confidence intervals in these figures, it is possible that the apparent trends are not statistically significant. One of the causes of the wide confidence intervals is the choice of the REML estimator for the variance components, and the approximations employed here for the resulting confidence intervals. It may be worthwhile to examine other estimators and alternative means of computing confidence intervals, for example those proposed by Vanwinckelen and Blockeel (2012).

The formalism developed here is guided by a comparison of variability due to local minima with that due to sampling in 10-fold cross validation. K is set to 10 because there is evidence that 10-fold cross validation constitutes an acceptable trade-off between bias and variance (Hastie et al. 2001). However, most recently, evidence has begun to gather that the usual arguments suggesting that variance increases with K may be flawed (Bengio and Grandvalet 2004; Burman 1989; Zhang and Yang 2015). As such, it will be interesting to extend the analysis performed here to other values of K to see whether the relationship between the sources of variability depends on K. Indeed, it will be interesting to compare variability due to local minima with that due to other resampling methods, for example, bootstrap.

Here, the local minima are all minima in the loss function of NNs, trained with a conjugate gradient method. Given that the loss function landscape is a function of both the model (i.e., the NN) and the optimization algorithm, another useful extension of this work will be to examine other models and/or optimization algorithms, for example, root-mean-square propagation (RMSProp; Tieleman and Hinton 2012), commonly used in the training of deep NNs. The effect of regularization (e.g., weight decay) may also be worth pursuing.

It is not immediately clear how much of the substantive conclusions in this work extend to deep learning. For example, is it likely that in some applications of deep learning, the variability due to local minima may be as large as or larger than that due to sampling? We conjecture that the answer is yes simply because deeper NNs are more likely to have more local minima. That said, there is nothing in the proposed methodology itself that prevents it from being applied to deep learning.

1

All NNs are trained with conjugate gradient (Bishop 1996), with no regularization (e.g., weight decay), with convergence defined as when the algorithm is unable to reduce MSE by a factor of 10−8, at a step.

2

An NN with H = 0 is equivalent to a linear regression model.

3

The ϵij term can also be interpreted as an interaction term; see section 5.

4

It turns out that the final results are similar for the two estimators.

5

First, in examining variance ratios, it is sufficient to employ only the training MSE values, because the variance ratio is not used for model selection. Also, although not shown here, based on examination of Q–Q plots of the responses in the computer data tables, there is no evidence that the usual assumptions of normality and constant variance are violated.

Acknowledgments.

We are grateful to Xiaopeng Cai, Chuyen Nguyen, and Mike Starek for providing the real data and related discussions and to Zaid Harchaoui for bringing to our attention the literature on the benefits of overparameterizing neural networks.

Data availability statement.

The (computer) data generated and used for this study are available upon request from the corresponding author.

REFERENCES

  • Bengio, Y., and Y. Grandvalet, 2004: No unbiased estimator of the variance of k-fold cross-validation. J. Mach. Learn. Res., 5, 10891105.

    • Search Google Scholar
    • Export Citation
  • Bishop, C. M., 1996: Neural Networks for Pattern Recognition. Oxford University Press, 482 pp.

  • Buonaccorsi, J. P., 2010: Measurement Error: Models, Methods, and Applications. Chapman & Hall/CRC, 464 pp.

  • Burman, P., 1989: A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76, 503514, https://doi.org/10.1093/biomet/76.3.503.

    • Search Google Scholar
    • Export Citation
  • Choromanska, A., M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, 2015: The loss surfaces of multilayer networks. Proc. 18th Int. Conf. on Artificial Intelligence and Statistics (AISTATS), Vol. 38, San Diego, CA, JMLR, 192204.

  • D’Amour, A., and Coauthors, 2020: Underspecification presents challenges for credibility in modern machine learning. arXiv, 2011.03395v2, https://doi.org/10.48550/arXiv.2011.03395.

  • Dauphin, Y., R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, 2014: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems 27 (NIPS 2014), Z. Ghahramani, et al., Eds., Vol. 4, Curran Associates, 2933–2941.

  • Fang, K.-T., R. Li, and A. Sudjianto, 2006: Design and Modeling for Computer Experiments. Chapman & Hall/CRC, 290 pp.

  • Fuller, W. A., 1987: Measurement Error Models. John Wiley and Sons, 440 pp.

  • Hastie, T., R. Tibshirani, and J. Friedman, 2001: The Elements of Statistical Learning. Springer, 533 pp.

  • James, G., D. Witten, T. Hastie, and R. Tibshirani, 2015: An Introduction to Statistical Learning. Springer, 426 pp.

  • Li, X., N. Sudarsanam, and D. D. Frey, 2006: Regularities in data from factorial experiments. Complexity, 11, 3245, https://doi.org/10.1002/cplx.20123.

    • Search Google Scholar
    • Export Citation
  • Masters, T., 1993: Practical Neural Network Recipes in C++. Academic Press, 493 pp.

  • Montgomery, D. C., 2009: Design and Analysis of Experiments. 7th ed. John Wiley and Sons, 656 pp.

  • Nguyen, C., M. J. Starek, P. E. Tissot, X. Cai, and J. Gibeaut, 2019: Ensemble neural networks for modeling DEM error. ISPRS Int. J. Geoinf., 8, 444, https://doi.org/10.3390/ijgi8100444.

    • Search Google Scholar
    • Export Citation
  • Sacks, J., W. J. Welch, T. J. Mitchell, and H. P. Wynn, 1989: Design and analysis of computer experiments. Stat. Sci., 4, 409423, https://doi.org/10.1214/ss/1177012413.

    • Search Google Scholar
    • Export Citation
  • Sankararaman, K. A., S. De, Z. Xu, W. R. Huang, and T. Goldstein, 2019: The impact of neural network overparameterization on gradient confusion and stochastic gradient descent.arXiv, 1904.06963v5, https://doi.org/10.48550/arXiv.1904.06963.

  • Santner, T. J., B. J. Williams, and W. Notz, 2003: The Design and Analysis of Computer Experiments. Springer, 299 pp.

  • Tieleman, T., and G. Hinton, 2012: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, Vol. 4, University of Toronto, 26–31.

  • Vanwinckelen, G., and H. Blockeel, 2012: On estimating model accuracy with repeated cross-validation. Proc. 21st Belgian–Dutch Conf. on Machine Learning, Ghent, Belgium, Ghent University, 3944.

  • Welch, W. J., R. J. Buck, J. Sacks, H. P. Wynn, T. J. Mitchell, and M. D. Morris, 1992: Screening, predicting, and computer experiments. Technometrics, 34, 1525, https://doi.org/10.2307/1269548.

    • Search Google Scholar
    • Export Citation
  • Zhang, Y., and Y. Yang, 2015: Cross-validation for selecting a model selection procedure. J. Econom., 187, 95112, https://doi.org/10.1016/j.jeconom.2015.02.006.

    • Search Google Scholar
    • Export Citation
Save
  • Bengio, Y., and Y. Grandvalet, 2004: No unbiased estimator of the variance of k-fold cross-validation. J. Mach. Learn. Res., 5, 10891105.

    • Search Google Scholar
    • Export Citation
  • Bishop, C. M., 1996: Neural Networks for Pattern Recognition. Oxford University Press, 482 pp.

  • Buonaccorsi, J. P., 2010: Measurement Error: Models, Methods, and Applications. Chapman & Hall/CRC, 464 pp.

  • Burman, P., 1989: A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76, 503514, https://doi.org/10.1093/biomet/76.3.503.

    • Search Google Scholar
    • Export Citation
  • Choromanska, A., M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, 2015: The loss surfaces of multilayer networks. Proc. 18th Int. Conf. on Artificial Intelligence and Statistics (AISTATS), Vol. 38, San Diego, CA, JMLR, 192204.

  • D’Amour, A., and Coauthors, 2020: Underspecification presents challenges for credibility in modern machine learning. arXiv, 2011.03395v2, https://doi.org/10.48550/arXiv.2011.03395.

  • Dauphin, Y., R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, 2014: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems 27 (NIPS 2014), Z. Ghahramani, et al., Eds., Vol. 4, Curran Associates, 2933–2941.

  • Fang, K.-T., R. Li, and A. Sudjianto, 2006: Design and Modeling for Computer Experiments. Chapman & Hall/CRC, 290 pp.

  • Fuller, W. A., 1987: Measurement Error Models. John Wiley and Sons, 440 pp.

  • Hastie, T., R. Tibshirani, and J. Friedman, 2001: The Elements of Statistical Learning. Springer, 533 pp.

  • James, G., D. Witten, T. Hastie, and R. Tibshirani, 2015: An Introduction to Statistical Learning. Springer, 426 pp.

  • Li, X., N. Sudarsanam, and D. D. Frey, 2006: Regularities in data from factorial experiments. Complexity, 11, 3245, https://doi.org/10.1002/cplx.20123.

    • Search Google Scholar
    • Export Citation
  • Masters, T., 1993: Practical Neural Network Recipes in C++. Academic Press, 493 pp.

  • Montgomery, D. C., 2009: Design and Analysis of Experiments. 7th ed. John Wiley and Sons, 656 pp.

  • Nguyen, C., M. J. Starek, P. E. Tissot, X. Cai, and J. Gibeaut, 2019: Ensemble neural networks for modeling DEM error. ISPRS Int. J. Geoinf., 8, 444, https://doi.org/10.3390/ijgi8100444.

    • Search Google Scholar
    • Export Citation
  • Sacks, J., W. J. Welch, T. J. Mitchell, and H. P. Wynn, 1989: Design and analysis of computer experiments. Stat. Sci., 4, 409423, https://doi.org/10.1214/ss/1177012413.

    • Search Google Scholar
    • Export Citation
  • Sankararaman, K. A., S. De, Z. Xu, W. R. Huang, and T. Goldstein, 2019: The impact of neural network overparameterization on gradient confusion and stochastic gradient descent.arXiv, 1904.06963v5, https://doi.org/10.48550/arXiv.1904.06963.

  • Santner, T. J., B. J. Williams, and W. Notz, 2003: The Design and Analysis of Computer Experiments. Springer, 299 pp.

  • Tieleman, T., and G. Hinton, 2012: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, Vol. 4, University of Toronto, 26–31.

  • Vanwinckelen, G., and H. Blockeel, 2012: On estimating model accuracy with repeated cross-validation. Proc. 21st Belgian–Dutch Conf. on Machine Learning, Ghent, Belgium, Ghent University, 3944.

  • Welch, W. J., R. J. Buck, J. Sacks, H. P. Wynn, T. J. Mitchell, and M. D. Morris, 1992: Screening, predicting, and computer experiments. Technometrics, 34, 1525, https://doi.org/10.2307/1269548.

    • Search Google Scholar
    • Export Citation
  • Zhang, Y., and Y. Yang, 2015: Cross-validation for selecting a model selection procedure. J. Econom., 187, 95112, https://doi.org/10.1016/j.jeconom.2015.02.006.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    The simulated data (circles), generated by the addition of normally distributed error to the function shown by the solid line. The function itself is an example of a function represented by an NN with six hidden nodes.

  • Fig. 2.

    The histogram of MSE for 500 NNs initialized with random weights drawn from a zero-mean normal distribution with σw = 0.1 (left bars) and σw = 10 (right bars).

  • Fig. 3.

    Boxplots summarizing the MSE of NNs trained and validated on the simulated data shown in Fig. 1, with different standard deviation of initial weights σw.

  • Fig. 4.

    Boxplots summarizing the MSE of NNs trained and validated on the real, multivariate data described in section 2.

  • Fig. 5.

    The ratio of variance components σIW2/σCV2 estimated from the computer data on MSE generated from training NNs on the simulated data, for different values of H. Also shown are the approximate 95% confidence intervals. The plots are for (top) σw = 0.1 and (bottom) 10, respectively.

  • Fig. 6.

    As in Fig. 5, but estimated from the computer data on MSE generated from training NNs on the real data.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 728 364 36
PDF Downloads 455 191 12