Abstract:Representations of the world environment play a crucial role in artificial intelligence. It is often inefficient to conduct reasoning and inference directly in the space of raw sensory representations, such as pixel values of images. Representation learning allows us to automatically discover suitable representations from raw sensory data. For example, given raw sensory data, a deep neural network learns nonlinear representations at its hidden layers, which are subsequently used for classification (or regression) at its output layer. This happens implicitly during training through minimizing a supervised or unsupervised loss in common practical regimes of deep learning, unlike the neural tangent kernel (NTK) regime. In this paper, we study the dynamics of such implicit nonlinear representation learning, which is beyond the NTK regime. We identify a pair of a new assumption and a novel condition, called the common model structure assumption and the data-architecture alignment condition. Under the common model structure assumption, the data-architecture alignment condition is shown to be sufficient for the global convergence and necessary for the global optimality. Moreover, our theory explains how and when increasing the network size does and does not improve the training behaviors in the practical regime. Our results provide practical guidance for designing a model structure: e.g., the common model structure assumption can be used as a justification for using a particular model structure instead of others. We also derive a new training framework based on the theory. The proposed framework is empirically shown to maintain competitive (practical) test performances while providing global convergence guarantees for deep residual neural networks with convolutions, skip connections, and batch normalization with standard benchmark datasets, including CIFAR-10, CIFAR-100, and SVHN.

Understanding How Pretraining Regularizes Deep Learning Algorithms.

On the Generalization Ability of Unsupervised Pretraining

Why does the unsupervised pretraining encourage moderate-sparseness?

An Analysis of Unsupervised Pre-training in Light of Recent Advances

Statistical-mechanical analysis of pre-training and fine tuning in deep learning

Traditional and Heavy-Tailed Self Regularization in Neural Network Models

Sparseness Analysis in the Pretraining of Deep Neural Networks

Preconditioning for Accelerated Gradient Descent Optimization and Regularization

Stabilizing RNN Gradients through Pre-training

Why does Deep Learning work? - A perspective from Group Theory

Why pre-training is beneficial for downstream classification tasks?

Understanding Training and Generalization in Deep Learning by Fourier Analysis.

Early Stopping of Untrained Convolutional Neural Networks

Exploring the Limits of Weakly Supervised Pretraining

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

Understanding Dynamics of Nonlinear Representation Learning and Its Application

Pre-Trained Models: Past, Present and Future

Continual Learning with Pretrained Backbones by Tuning in the Input Space

Disentangling Trainability and Generalization in Deep Neural Networks

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold