Abstract:Despite the widespread success of deep learning in various applications, neural network theory has been lagging behind. The choice of the activation function plays a critical role in the expressivity of a neural network but for reasons that are not yet fully understood. While the rectified linear unit (ReLU) is currently one of the most popular activation functions, ReLU squared has only recently been empirically shown to be pivotal in producing consistently superior results for state-of-the-art deep learning tasks (So et al., 2021). To analyze the expressivity of neural networks with ReLU powers, we employ the novel framework of Gribonval et al. (2022) based on the classical concept of approximation spaces. We consider the class of functions for which the approximation error decays at a sufficiently fast rate as network complexity, measured by the number of weights, increases. We show that when approximating sufficiently smooth functions that cannot be represented by sufficiently low-degree polynomials, networks with ReLU powers need less depth than those with ReLU. Moreover, if they have the same depth, networks with ReLU powers can have potentially faster approximation rates. Lastly, our computational experiments on approximating the Rastrigin and Ackley functions with deep neural networks showed that ReLU squared and ReLU cubed networks consistently outperform ReLU networks.

A theoretical framework for deep locally connected ReLU network

DC is all you need: describing ReLU from a signal processing standpoint

Functional Network: A Novel Framework for Interpretability of Deep Neural Networks

On the Local Complexity of Linear Regions in Deep ReLU Networks

An Information-Theoretic Framework for Supervised Learning

Universal Consistency of Deep ReLU Neural Networks

Neural networks with ReLU powers need less depth

Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks

Deep Representation with ReLU Neural Networks

On the Principles of ReLU Networks with One Hidden Layer

A Layer-Wise Theoretical Framework for Deep Learning of Convolutional Neural Networks

Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical Study

Nonparametric regression using deep neural networks with ReLU activation function

A global convergence theory for deep ReLU implicit networks via over-parameterization

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

On the importance of network architecture in training very deep neural networks

Locally linear attributes of ReLU neural networks

The Evolution of the Interplay Between Input Distributions and Linear Regions in Networks

Unwrapping The Black Box of Deep ReLU Networks: Interpretability, Diagnostics, and Simplification

A note about why deep learning is deep: A discontinuous approximation perspective

Luck Matters: Understanding Training Dynamics of Deep ReLU Networks