Abstract:The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case, and instead reflect assumptions made to compute a finite mutual information metric in deterministic networks. When computed using simple binning, we demonstrate through a combination of analytical results and simulation that the information plane trajectory observed in prior work is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.

Penetrating the influence of regularizations on neural network based on information bottleneck theory

Consistency of Neural Networks with Regularization

The Efficacy of Regularization in Two Layer Neural Networks

On Regularization for Explaining Graph Neural Networks: An Information Theory Perspective

An Information-Theoretic Regularizer for Lossy Neural Image Compression

Network as Regularization for Training Deep Neural Networks: Framework, Model and Performance

Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations

Regularization theory in the study of generalization ability of a biological neural network model

A Comparative Study on Regularization Strategies for Embedding-based Neural Networks.

On the information bottleneck theory of deep learning

Information Bottleneck Theory on Convolutional Neural Networks

Towards Understanding Regularization in Batch Normalization

Information-Theoretic Local Minima Characterization and Regularization

Information-Theoretic Foundations for Neural Scaling Laws

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

How Does Information Bottleneck Help Deep Learning?

An Improving Framework of regularization for Network Compression

Effective Neural Network $L_0$ Regularization With BinMask

Implicit Regularization of Dropout

Singular Regularization with Information Bottleneck Improves Model's Adversarial Robustness

L0 Regularization Based Neural Network Design and Compression