Abstract:This paper studies the properties of solutions to multi-task shallow ReLU neural network learning problems, wherein the network is trained to fit a dataset with minimal sum of squared weights. Remarkably, the solutions learned for each individual task resemble those obtained by solving a kernel method, revealing a novel connection between neural networks and kernel methods. It is known that single-task neural network training problems are equivalent to minimum norm interpolation problem in a non-Hilbertian Banach space, and that the solutions of such problems are generally non-unique. In contrast, we prove that the solutions to univariate-input, multi-task neural network interpolation problems are almost always unique, and coincide with the solution to a minimum-norm interpolation problem in a Sobolev (Reproducing Kernel) Hilbert Space. We also demonstrate a similar phenomenon in the multivariate-input case; specifically, we show that neural network learning problems with large numbers of diverse tasks are approximately equivalent to an $\ell^2$ (Hilbert space) minimization problem over a fixed kernel determined by the optimal neurons.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is the impact of multi - task learning on the function characteristics of ReLU neural networks, especially how the solutions of each task differ from those of single - task learning when the network is trained to minimize the sum of squared weights. Specifically, the paper explores the following points: 1. **Uniqueness of multi - task learning solutions**: In the case of univariate input ($d = 1$), the paper proves that for different tasks, the solutions of multi - task learning are almost always unique, and gives the special - case conditions for non - unique solutions. 2. **Equivalence between multi - task learning and kernel methods**: When the solution for univariate input is unique, it can be interpolated by connecting data points, which is equivalent to the minimum - norm interpolation problem in the Sobolev space $H^1$. This means that the solution of each task is equivalent to the solution of the kernel method, while the solution of single - task learning is usually not unique and is the minimum - norm interpolation in the non - Hilbertian Banach space $BV^2$. 3. **Insights into multivariate multi - task problems**: The paper provides experimental evidence and mathematical analysis, indicating that similar conclusions also apply to multivariate settings. Specifically, when the number of tasks is large and diverse, the solution of each task is approximately the minimum - norm solution in a specific RKHS space. ### Formula Summary - The form of the ReLU neural network function: \[ f_\theta(x)=\sum_{k = 1}^{K}v_k(w_k^{\top}x + b_k)_++Ax + c \] where $(\cdot)_+=\max\{0,\cdot\}$, $w_k\in\mathbb{R}^d$, $v_k\in\mathbb{R}^T$, $b_k\in\mathbb{R}$, $A\in\mathbb{R}^{T\times d}$, $c\in\mathbb{R}^T$. - The weight - decay interpolation problem: \[ \min_{\theta}\sum_{k = 1}^{K}\|v_k\|_2^2+\|w_k\|_2^2\quad\text{subject to}\quad f_\theta(x_i)=y_i,\quad i = 1,\ldots,N \] - The equivalent optimization problem: \[ \min_{\theta}\sum_{k = 1}^{K}\|v_k\|_2\quad\text{subject to}\quad\|w_k\|_2 = 1,\quad f_\theta(x_i)=y_i,\quad i = 1,\ldots,N \] - The slope of the interpolation function of connecting points: \[ s_i^t=\frac{y_{i + 1,t}-y_{i,t}}{x_{i+1}-x_i} \] - Uniqueness conditions: For some $i = 2,\ldots,N - 2$, the vectors \[ s_i - s_{i-1}=\frac{y_{i+1}-y_i}{x_{i+1}-x_i}-\frac{y_i - y_{i-1}}{x_i - x_{i-1}} \] and \[ s_{i+1}-s_i=\frac{y_{i+2}-y_{i+1}}{x_{i+2}-x_{i+1}}-\frac{y_{i+1}-y_i}{x_{i+1}-x_i} \] are both non - zero and aligned. Through these studies, the paper reveals the unique impact of multi - task learning on neural network solutions and establishes a connection with traditional kernel methods.

The Effects of Multi-Task Learning on ReLU Neural Network Functions

Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks

Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs

Low-Rank Deep Convolutional Neural Network for Multi-Task Learning

When Do Neural Networks Outperform Kernel Methods?

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Optimal rates of approximation by shallow ReLU$^k$ neural networks and applications to nonparametric regression

Learning Narrow One-Hidden-Layer ReLU Networks

Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via $\ell_1, \ell_0$, and transformed-$\ell_1$ Penalties

Eigenfunction-Based Multitask Learning in a Reproducing Kernel Hilbert Space.

Nonparametric regression using over-parameterized shallow ReLU neural networks

Inductive biases of multi-task learning and finetuning: multiple regimes of feature reuse

Universal Solutions of Feedforward ReLU Networks for Interpolations

On the Banach Spaces Associated with Multi-Layer ReLU Networks: Function Representation, Approximation Theory and Gradient Descent Dynamics

ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models

Parallel Learning by Multitasking Neural Networks

Nonparametric regression using deep neural networks with ReLU activation function

Properties of the geometry of solutions and capacity of multi-layer neural networks with Rectified Linear Units activations

Effect of Activation Functions on the Training of Overparametrized Neural Nets

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression