Notes on Deep Learning Theory

Eugene A. Golikov
DOI: https://doi.org/10.48550/arXiv.2012.05760
2020-12-10
Abstract:These are the notes for the lectures that I was giving during Fall 2020 at the Moscow Institute of Physics and Technology (MIPT) and at the Yandex School of Data Analysis (YSDA). The notes cover some aspects of initialization, loss landscape, generalization, and a neural tangent kernel theory. While many other topics (e.g. expressivity, a mean-field theory, a double descent phenomenon) are missing in the current version, we plan to add them in future revisions.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve several key problems in deep learning theory, mainly including initialization, loss landscape, generalization ability, and neural tangent kernel theory. Specifically: 1. **Initialization problem**: - The paper explores how to correctly initialize the weights of a neural network to ensure that the signal neither disappears nor explodes during the training process. This involves maintaining the variance stability in the forward propagation and backward propagation processes. - For different types of activation functions (such as ReLU, Tanh, etc.), the paper proposes different initialization strategies. For example, for ReLU, \( v_l=\frac{2}{n_l} \) is used, and for the linear layer, \( v_l = \frac{1}{n_l} \) is used. 2. **Loss Landscape problem**: - The loss landscape describes the distribution of the loss function in the parameter space. The paper studies the loss landscape characteristics of wide non - linear networks and linear networks and explores the guarantees of local convergence. - By analyzing the loss landscape, the paper attempts to understand why the gradient descent method can find the global minimum in practical applications, although the loss function is non - convex. 3. **Generalization problem**: - Generalization ability refers to the performance of a model on unseen data. The paper discusses how to evaluate and improve the generalization ability of a model, including uniform bounds and PAC - Bayesian bounds. - The paper points out that traditional complexity measures such as VC dimension may not be able to well explain the generalization phenomenon of modern deep neural networks, so new complexity measurement methods are introduced. 4. **Neural Tangent Kernel (NTK) theory**: - NTK theory links the training process of neural networks with kernel methods. Especially in the infinite - width limit, the training of neural networks can be regarded as a kernel regression problem. - The paper explores the stability of NTK and its influence on the convergence of gradient descent, revealing the dynamic behavior of wide neural networks during the training process. In summary, this paper aims to provide a theoretical basis for understanding and optimizing deep learning models through in - depth analysis of these aspects.