Abstract:The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case, and instead reflect assumptions made to compute a finite mutual information metric in deterministic networks. When computed using simple binning, we demonstrate through a combination of analytical results and simulation that the information plane trajectory observed in prior work is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.

Understanding Deep Learning by Revisiting Boltzmann Machines: an Information Geometry Approach

Understanding Boltzmann Machine and Deep Learning via A Confident Information First Principle

A Confident Information First Principle for Parameter Reduction and Model Selection of Boltzmann Machines.

Training Restricted Boltzmann Machines with Binary Synapses Using the Bayesian Learning Rule

Monotone deep Boltzmann machines

An Optimized Dimensionality Reduction Model for High-Dimensional Data Based on Restricted Boltzmann Machines

Generative and Discriminative Infinite Restricted Boltzmann Machine Training

Deep Variational Multivariate Information Bottleneck -- A Framework for Variational Losses

Restricted Boltzmann Machines: Introduction and Review

Extending the Extreme Physical Information to Universal Cognitive Models via a Confident Information First Principle.

On Training Deep Boltzmann Machines

A 3D Model Recognition Mechanism Based on Deep Boltzmann Machines

Mining Invariance in Restricted Boltzmann Machine Via Information Geometry

Simultaneous Dimensionality Reduction for Extracting Useful Representations of Large Empirical Multimodal Datasets

What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries

Weight Uncertainty in Boltzmann Machine.

Generalized Boltzmann Machine with Deep Neural Structure.

Multi-Scale Shape Boltzmann Machine: A Shape Model Based on Deep Learning Method

On the information bottleneck theory of deep learning

Boltzmann Machine And Its Applications In Image Recognition

Deep Narrow Boltzmann Machines are Universal Approximators