Abstract:These are the notes for the lectures that I was giving during Fall 2020 at the Moscow Institute of Physics and Technology (MIPT) and at the Yandex School of Data Analysis (YSDA). The notes cover some aspects of initialization, loss landscape, generalization, and a neural tangent kernel theory. While many other topics (e.g. expressivity, a mean-field theory, a double descent phenomenon) are missing in the current version, we plan to add them in future revisions.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in deep learning theory, mainly including initialization, loss landscape, generalization ability, and neural tangent kernel theory. Specifically: 1. **Initialization problem**: - The paper explores how to correctly initialize the weights of a neural network to ensure that the signal neither disappears nor explodes during the training process. This involves maintaining the variance stability in the forward propagation and backward propagation processes. - For different types of activation functions (such as ReLU, Tanh, etc.), the paper proposes different initialization strategies. For example, for ReLU, \( v_l=\frac{2}{n_l} \) is used, and for the linear layer, \( v_l = \frac{1}{n_l} \) is used. 2. **Loss Landscape problem**: - The loss landscape describes the distribution of the loss function in the parameter space. The paper studies the loss landscape characteristics of wide non - linear networks and linear networks and explores the guarantees of local convergence. - By analyzing the loss landscape, the paper attempts to understand why the gradient descent method can find the global minimum in practical applications, although the loss function is non - convex. 3. **Generalization problem**: - Generalization ability refers to the performance of a model on unseen data. The paper discusses how to evaluate and improve the generalization ability of a model, including uniform bounds and PAC - Bayesian bounds. - The paper points out that traditional complexity measures such as VC dimension may not be able to well explain the generalization phenomenon of modern deep neural networks, so new complexity measurement methods are introduced. 4. **Neural Tangent Kernel (NTK) theory**: - NTK theory links the training process of neural networks with kernel methods. Especially in the infinite - width limit, the training of neural networks can be regarded as a kernel regression problem. - The paper explores the stability of NTK and its influence on the convergence of gradient descent, revealing the dynamic behavior of wide neural networks during the training process. In summary, this paper aims to provide a theoretical basis for understanding and optimizing deep learning models through in - depth analysis of these aspects.

Notes on Deep Learning Theory

Lecture Notes on Linear Neural Networks: A Tale of Optimization and Generalization in Deep Learning

Notes on Deep Learning for NLP

TASI Lectures on Physics for Machine Learning

Deep Learning and Computational Physics (Lecture Notes)

Mathematics of Neural Networks (Lecture Notes Graduate Course)

Kernels, Data & Physics

Learning Curves for Deep Neural Networks: A Gaussian Field Theory Perspective

Artificial Neural Network and Deep Learning: Fundamentals and Theory

Mathematical theory of deep learning

Recent advances in deep learning theory

Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory

Applying statistical learning theory to deep learning

Lecture Notes: Optimization for Machine Learning

Optimization for deep learning: theory and algorithms

Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory

Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes.

Deep Learning and Geometric Deep Learning: an introduction for mathematicians and physicists