Abstract:The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case, and instead reflect assumptions made to compute a finite mutual information metric in deterministic networks. When computed using simple binning, we demonstrate through a combination of analytical results and simulation that the information plane trajectory observed in prior work is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.

WDIBS: Wasserstein Deterministic Information Bottleneck for State Abstraction to Balance State-Compression and Performance

Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching

Elastic Information Bottleneck

Representation Learning in Deep RL via Discrete Information Bottleneck

State Abstraction via Deep Supervised Hash Learning

Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices

The deterministic information bottleneck

Hierarchical State Abstraction Based on Structural Information Principles

Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation

Iteratively Learn Diverse Strategies with State Distance Information

Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

On the information bottleneck theory of deep learning

A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets

Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration

Augmenting Human Expertise in Weighted Ensemble Simulations through Deep Learning based Information Bottleneck

Wasserstein Contrastive Representation Distillation

Dynamics Generalization via Information Bottleneck in Deep Reinforcement Learning

Learning Representations in Reinforcement Learning:An Information Bottleneck Approach

D3D: Conditional Diffusion Model for Decision-Making under Random Frame Dropping

Learning Discrete State Abstractions With Deep Variational Inference