Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory

Micah Goldblum,Jonas Geiping,Avi Schwarzschild,Michael Moeller,Tom Goldstein
2020-04-29
Abstract:We empirically evaluate common assumptions about neural networks that are widely held by practitioners and theorists alike. In this work, we: (1) prove the widespread existence of suboptimal local minima in the loss landscape of neural networks, and we use our theory to find examples; (2) show that small-norm parameters are not optimal for generalization; (3) demonstrate that ResNets do not conform to wide-network theories, such as the neural tangent kernel, and that the interaction between skip connections and batch normalization plays a role; (4) find that rank does not correlate with generalization or robustness in a practical setting.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate and verify whether several widely - accepted hypotheses about neural networks hold in practical applications. Specifically, the author focuses on the following four aspects of problems: 1. **Local Minima**: - Theoretically, many studies believe that all local minima of the neural network loss function are globally optimal or nearly optimal. However, the author has found sub - optimal local minima in the actual neural network loss function and explored the reasons for the existence of these local minima on the deep neural network loss surface. 2. **Weight Decay and Parameter Norms**: - Research inspired by Tikhonov regularization shows that low - norm solutions usually have better generalization ability, which provides an intuitive reason for simple regularization methods such as weight decay. But the author has found through experiments that for modern architectures, solutions biased towards non - zero norms are still effective and can even improve performance. 3. **Neural Tangent Kernels and the Wide - Network Limit**: - Theoretical research shows that in the wide - network limit, the neural tangent kernel (NTK) remains almost unchanged, and the training dynamics of the neural network can be described as gradient descent on a convex function. However, the author has found that these theoretical predictions do not apply to actual networks, especially the ResNet architecture. They have shown the impact of the interaction between skip connections and batch normalization on this trend. 4. **Rank**: - Generalization theory provides guarantees for the performance of low - rank networks. However, the author has found that regularization methods that encourage high - rank weight matrices are often superior to methods that promote low - rank matrices. This indicates that in actual networks, the low - rank structure is not a key factor in generalization ability. In addition, they have also studied the adversarial robustness of low - rank networks and found that its robustness is usually lower than that of the baseline or specially - made high - rank networks. Overall, through empirical research on these hypotheses, this paper reveals the gap between theory and practice and provides new insights into understanding the actual behavior of deep learning.