Abstract:Representations of the world environment play a crucial role in artificial intelligence. It is often inefficient to conduct reasoning and inference directly in the space of raw sensory representations, such as pixel values of images. Representation learning allows us to automatically discover suitable representations from raw sensory data. For example, given raw sensory data, a deep neural network learns nonlinear representations at its hidden layers, which are subsequently used for classification (or regression) at its output layer. This happens implicitly during training through minimizing a supervised or unsupervised loss in common practical regimes of deep learning, unlike the neural tangent kernel (NTK) regime. In this paper, we study the dynamics of such implicit nonlinear representation learning, which is beyond the NTK regime. We identify a pair of a new assumption and a novel condition, called the common model structure assumption and the data-architecture alignment condition. Under the common model structure assumption, the data-architecture alignment condition is shown to be sufficient for the global convergence and necessary for the global optimality. Moreover, our theory explains how and when increasing the network size does and does not improve the training behaviors in the practical regime. Our results provide practical guidance for designing a model structure: e.g., the common model structure assumption can be used as a justification for using a particular model structure instead of others. We also derive a new training framework based on the theory. The proposed framework is empirically shown to maintain competitive (practical) test performances while providing global convergence guarantees for deep residual neural networks with convolutions, skip connections, and batch normalization with standard benchmark datasets, including CIFAR-10, CIFAR-100, and SVHN.

Geometric Insights into the Convergence of Nonlinear TD Learning

Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features

Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation

A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning

Finite-Time Analysis of Adaptive Temporal Difference Learning with Deep Neural Networks

A Convergent Off-Policy Temporal Difference Algorithm

A Simple Finite-Time Analysis of TD Learning with Linear Function Approximation

Almost Sure Convergence of Average Reward Temporal Difference Learning

Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Gradient Descent Temporal Difference-Difference Learning

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

Decentralized Adaptive TD $(\lambda)$ Learning with Linear Function Approximation: Nonasymptotic Analysis

Target-Based Temporal Difference Learning

Gauss-Newton Temporal Difference Learning with Nonlinear Function Approximation

Differentially Private Temporal Difference Learning with Stochastic Nonconvex-Strongly-Concave Optimization

Understanding Dynamics of Nonlinear Representation Learning and Its Application

Non-asymptotic and Accurate Learning of Nonlinear Dynamical Systems

Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning

Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory

Towards a Better Understanding of Representation Dynamics under TD-learning