Learning Two-Layer ReLU Networks Is Nearly as Easy as Learning Linear Classifiers on Separable Data

Qiuling Yang,Alireza Sadeghi,Gang Wang,Jian Sun
DOI: https://doi.org/10.1109/tsp.2021.3094911
IF: 4.875
2021-01-01
IEEE Transactions on Signal Processing
Abstract:Neural networks with non-linear rectified linear unit (ReLU) activation functions have demonstrated remarkable performance in many fields.It has been observed that a sufficiently wide and/or deep ReLU network can accurately fit the training data, with a small generalization error on the testing data.Nevertheless, existing analytical results on provably training ReLU networks are mostly limited to over-parameterized cases, or they require assumptions on the data distribution.In this paper, training a two-layer ReLU network for binary classification of linearly separable data is revisited.Adopting the hinge loss as classification criterion yields a non-convex objective function with infinite local minima and saddle points.Instead, a modified loss is proposed which enables (stochastic) gradient descent to attain a globally optimal solution.Enticingly, the solution found is globally optimal for the hinge loss too.In addition, an upper bound on the number of iterations required to find a global minimum is derived.To ensure generalization performance, a convex max-margin formulation for two-layer ReLU network classifiers is discussed.Connections between the sought max-margin ReLU network and the max-margin support vector machine are drawn.Finally, an algorithm-dependent theoretical quantification of the generalization performance is developed using classical compression bounds.Numerical tests using synthetic and real data validate the analytical results.
engineering, electrical & electronic
What problem does this paper attempt to address?