Abstract:Neural networks with rectified linear unit (ReLU) activation functions (a.k.a. ReLU networks) have achieved great empirical success in various domains. Nonetheless, existing results for learning ReLU networks either pose assumptions on the underlying data distribution being, e.g., Gaussian, or require the network size and/or training size to be sufficiently large. In this context, the problem of learning a two-layer ReLU network is approached in a binary classification setting, where the data are linearly separable and a hinge loss criterion is adopted. Leveraging the power of random noise perturbation, this paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general. This result is the first of its kind, requiring no assumptions on the data distribution, training/network size, or initialization. Convergence of the resultant iterative algorithm to a global minimum is analyzed by establishing both an upper bound and a lower bound on the number of non-zero updates to be performed. Moreover, generalization guarantees are developed for ReLU networks trained with the novel SGD leveraging classic compression bounds. These guarantees highlight a key difference (at least in the worst case) between reliably learning a ReLU network as well as a leaky ReLU network in terms of sample complexity. Numerical tests using both synthetic data and real images validate the effectiveness of the algorithm and the practical merits of the theory.

Complexity of Training ReLU Neural Network

The Computational Complexity of ReLU Network Training Parameterized by Data Dimensionality

Polynomial-Time Solutions for ReLU Network Training: A Complexity Classification via Max-Cut and Zonotopes

On the Principles of ReLU Networks with One Hidden Layer

Complexity of Deciding Injectivity and Surjectivity of ReLU Neural Networks

Complexity of Neural Network Training and ETR: Extensions with Effectively Continuous Functions

Learning Narrow One-Hidden-Layer ReLU Networks

Overparameterized ReLU Neural Networks Learn the Simplest Model: Neural Isometry and Phase Transitions

Neural networks with ReLU powers need less depth

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

On the Local Complexity of Linear Regions in Deep ReLU Networks

Topological obstruction to the training of shallow ReLU neural networks

Phase Diagram for Two-layer ReLU Neural Networks at Infinite-width Limit.

On the Complexity of Learning Neural Networks

Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

Towards Lower Bounds on the Depth of ReLU Neural Networks

On the Hardness of Training Deep Neural Networks Discretely

Deep ReLU Networks Have Surprisingly Simple Polytopes

Optimal function approximation with ReLU neural networks

Convex Formulations for Training Two-Layer ReLU Neural Networks

Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks