Abstract:Deep neural networks (DNNs) have achieved state-of-the-art performance in various learning tasks, such as computer vision, natural language processing, and speech recognition. However, the fundamental theory of generalization still remains obscure in deep learning-why DNN models can generalize well, despite that they are heavily overparametrized in both depth and width? Recently, some work shows that traditional theory of analyzing the generalization error of learning models fails to explain the generalization of DNNs. The failure is mainly because of one simple fact that the worse case analysis of generalization error for learning models would be too loose for models with large parameter space, such as DNNs. In this work, we propose a new analysis of generalization in DNNs from an optimal transport perspective. Unlike traditional worse-case uniform convergence analysis in learning theory, our analysis of generalization error is dependent on both the learning algorithm and the data distribution and is the average-case analysis. Thus, our theory can be more practical and accurate to describe the generalization behavior of DNNs. More specifically, in this article, we try to answer a fundamental yet unsolved question in deep learning-why deeper models can generalize well than shallow models? The main contribution of this article can be summarized in four aspects. First, under a general learning framework, we derive upper bounds on the generalization error of learning algorithms by their algorithmic transport cost: the expected Wasserstein distance between the output hypothesis and the output hypothesis conditioned on an input example. We further provide several upper bounds on the algorithmic transport cost in terms of total variation distance, relative entropy, and Vapnik-Chervonenkis (VC) dimension. Moreover, we also study different conditions for loss functions under which the generalization error of a learning algorithm can be upper bounded by different probability metrics between distributions relating to the output hypothesis and/or the input data. Finally, under our established framework, we obtain our main results, showing that the generalization error in DNNs decreases exponentially to zero as the number of layers increases.

Information-Theoretic Local Minima Characterization and Regularization

Information-Theoretic Generalization Bounds for Deep Neural Networks

An Interpretable Regularization Method Based on Minimizing Mutual Information

Going Deeper, Generalizing Better: an Information-Theoretic View for Deep Learning.

Neighborhood Region Smoothing Regularization for Finding Flat Minima in Deep Neural Networks

Neighborhood Region Smoothing Regularization for Finding Flat Minima In Deep Neural Networks

Regularizing Neural Networks Via Retaining Confident Connections.

Towards Generalization Beyond Pointwise Learning: A Unified Information-theoretic Perspective

Generalize Deep Neural Networks with Adaptive Regularization for Classifying

An Information-Theoretic Regularizer for Lossy Neural Image Compression

An Optimal Transport Analysis on Generalization in Deep Learning

Theory IIIb: Generalization in Deep Networks

Penetrating the influence of regularizations on neural network based on information bottleneck theory

Understanding Generalization in Deep Learning via Tensor Methods

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes.

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Topological Regularization for Representation Learning Via Persistent Homology

An Information-Theoretic Framework for Out-of-Distribution Generalization with Applications to Stochastic Gradient Langevin Dynamics

A Convex Relaxation Approach to Generalization Analysis for Parallel Positively Homogeneous Networks