Abstract:The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case, and instead reflect assumptions made to compute a finite mutual information metric in deterministic networks. When computed using simple binning, we demonstrate through a combination of analytical results and simulation that the information plane trajectory observed in prior work is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.

Information Bottleneck Theory on Convolutional Neural Networks

On the information bottleneck theory of deep learning

Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression

Deep Learning and the Information Bottleneck Principle

Layer-wise Learning of Stochastic Neural Networks with Information Bottleneck

Information Bottleneck: Theory and Applications in Deep Learning

IB-AdCSCNet:Adaptive Convolutional Sparse Coding Network Driven by Information Bottleneck

How Does Information Bottleneck Help Deep Learning?

Justices for Information Bottleneck Theory

A Survey on Information Bottleneck

Information Bottleneck in Deep Learning - A Semiotic Approach

Elastic Information Bottleneck

A Critical Review of Information Bottleneck Theory and its Applications to Deep Learning

PAC-Bayes Information Bottleneck

Drill the Cork of Information Bottleneck by Inputting the Most Important Data

Information Bottleneck Theory Based Exploration of Cascade Learning

Information-Ordered Bottlenecks for Adaptive Semantic Compression

Efficient and Provably Convergent Computation of Information Bottleneck: A Semi-Relaxed Approach

Learning to Compress: Local Rank and Information Compression in Deep Neural Networks

Tighter Bounds on the Information Bottleneck with Application to Deep Learning

Penetrating the influence of regularizations on neural network based on information bottleneck theory