What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to construct explicit global minimum solutions for a specific type of deep ReLU neural network, especially for "sequentially separable" data sets. Specifically, the author hopes to understand and explain the role of each layer of the neural network, and by constructing a truncation map, map each type of data to a point, thereby achieving zero - loss classification. ### Specific description of the problem 1. **Data types**: - The paper considers two types of data configurations: - **Sufficiently small and well - separated clusters**: Data points of each category form a small, well - separated cluster. - **Sequentially linearly separable data**: Data can separate the data of one category from other categories by a hyperplane at each step. 2. **Objectives**: - For the given training data \( X_0=\bigcup_{j = 1}^Q X_{0,j}\subset\mathbb{R}^M\), where \( Q\) is the number of categories and \( M\) is the input dimension, the objective is to find a ReLU neural network that can classify these data with zero loss. - Specifically, the author hopes to find a ReLU neural network with \( Q + 1\) layers, whose weights and biases can be explicitly represented by cumulative parameters, so that the network reaches the global minimum on the training data. ### Solutions 1. **Truncation Map**: - The author introduces the truncation map \(\tau_{W,b}(x)=(W)^+(\sigma(Wx + b)-b)\), where \( W\) is the weight matrix, \( b\) is the bias vector, and \(\sigma\) is the ReLU activation function. - The role of the truncation map is to project certain regions in the input space (such as the backward cone) to a point while keeping other regions (such as the forward cone) unchanged. 2. **Recursively construct global minimum solutions**: - For each layer, by choosing appropriate weight matrices \( W\) and bias vectors \( b\), it can be ensured that each type of data is mapped to a point while keeping the data of other categories unchanged. - The last layer is an affine transformation used to match the class averages with the reference output. 3. **Theoretical results**: - For sufficiently small and well - separated cluster data, the author proves that a ReLU neural network can achieve zero - loss classification with \( Q(M+Q/2)\) parameters. - For sequentially linearly separable data, the author proves that a ReLU neural network can achieve zero - loss classification with \( Q + 1\) layers, each layer having a width of \( d_0=d_1=\cdots=d_Q = M\geq Q\), and the last layer having a width of \( d_{Q + 1}=Q\). ### Conclusion By introducing the truncation map and cumulative parameters, the author successfully constructs explicit global minimum solutions for specific types of data. This method not only provides an explanation for the role of each layer of the neural network but also provides a new perspective for understanding the optimization problems of deep ReLU neural networks.

Interpretable global minima of deep ReLU neural networks on sequentially separable data

Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

Learning Two-Layer ReLU Networks Is Nearly as Easy as Learning Linear Classifiers on Separable Data

Rethinking generalization of classifiers in separable classes scenarios and over-parameterized regimes

Elimination of All Bad Local Minima in Deep Learning

Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

Universal Consistency of Wide and Deep ReLU Neural Networks and Minimax Optimal Convergence Rates for Kolmogorov-Donoho Optimal Function Classes

Deep Neural Networks: Multi-Classification and Universal Approximation

Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escaping, and Network Embedding

Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization

Hidden Minima in Two-Layer ReLU Networks

Locally linear attributes of ReLU neural networks

Are deep ResNets provably better than linear predictors?

Theory IIIb: Generalization in Deep Networks

Exact Solutions of a Deep Linear Network

How do Minimum-Norm Shallow Denoisers Look in Function Space?

Implicit Hypersurface Approximation Capacity in Deep ReLU Networks

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

Bridging the Gap Between Approximation and Learning via Optimal Approximation by ReLU MLPs of Maximal Regularity