Abstract:Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the rôle played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from these operations can be interpreted as {\em max-affine spline operators} (MASOs) that have an elegant link to vector quantization (VQ) and $K$-means. While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. {\em This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs).} We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural "hard" VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding "soft" VQ inference problems. We further extend the framework by hybridizing the hard and soft VQ optimizations to create a $\beta$-VQ inference that interpolates between hard, soft, and linear VQ inference. A prime example of a $\beta$-VQ DN nonlinearity is the {\em swish} nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters.

Learning Neural Networks with Two Nonlinear Layers in Polynomial Time

Learning Polynomial Problems with $SL(2,\mathbb{R})$ Equivariance

Learning Narrow One-Hidden-Layer ReLU Networks

Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Learning a Single Neuron for Non-monotonic Activation Functions

Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks

Learning Two-layer Neural Networks with Symmetric Inputs

Learning Algorithms via Neural Logic Networks

On the Complexity of Learning Neural Networks

Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals

Online Learning with Gated Linear Networks

The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks

Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off

Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials

Layer Dynamics of Linearised Neural Nets

On the Learning Dynamics of Two-layer Nonlinear Convolutional Neural Networks.

Linear approximability of two-layer neural networks: A comprehensive analysis based on spectral decay

Almost-Orthogonal Layers for Efficient General-Purpose Lipschitz Networks

NN2Poly: A polynomial representation for deep feed-forward artificial neural networks

A Unified Algebraic Perspective on Lipschitz Neural Networks

From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference