Abstract:The problem of finding a reduced dimensionality representation of categorical variables while preserving their most relevant characteristics is fundamental for the analysis of complex data. Specifically, given a co-occurrence matrix of two variables, one often seeks a compact representation of one variable which preserves information about the other variable. We have recently introduced ``Sufficient Dimensionality Reduction' [GT-2003], a method that extracts continuous reduced dimensional features whose measurements (i.e., expectation values) capture maximal mutual information among the variables. However, such measurements often capture information that is irrelevant for a given task. Widely known examples are illumination conditions, which are irrelevant as features for face recognition, writing style which is irrelevant as a feature for content classification, and intonation which is irrelevant as a feature for speech recognition. Such irrelevance cannot be deduced apriori, since it depends on the details of the task, and is thus inherently ill defined in the purely unsupervised case. Separating relevant from irrelevant features can be achieved using additional side data that contains such irrelevant structures. This approach was taken in [CT-2002], extending the information bottleneck method, which uses clustering to compress the data. Here we use this side-information framework to identify features whose measurements are maximally informative for the original data set, but carry as little information as possible on a side data set. In statistical terms this can be understood as extracting statistics which are maximally sufficient for the original dataset, while simultaneously maximally ancillary for the side dataset. We formulate this tradeoff as a constrained optimization problem and characterize its solutions. We then derive a gradient descent algorithm for this problem, which is based on the Generalized Iterative Scaling method for finding maximum entropy distributions. The method is demonstrated on synthetic data, as well as on real face recognition datasets, and is shown to outperform standard methods such as oriented PCA.

Relevant sparse codes with variational information bottleneck

Generalized Information Bottleneck for Gaussian Variables

Opportunistic Information-Bottleneck for Goal-oriented Feature Extraction and Communication

Flexible Variational Information Bottleneck: Achieving Diverse Compression with a Single Training

Tighter Bounds on the Information Bottleneck with Application to Deep Learning

Information Bottleneck Revisited: Posterior Probability Perspective with Optimal Transport

Deep Variational Multivariate Information Bottleneck -- A Framework for Variational Losses

A Variance Minimization Criterion to Feature Selection Using Laplacian Regularization

Sufficient Dimensionality Reduction with Irrelevant Statistics

Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Variational Predictive Information Bottleneck

Sparse Orthogonal Variational Inference for Gaussian Processes

Cauchy-Schwarz Divergence Information Bottleneck for Regression

Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

A variational Bayes approach to debiased inference for low-dimensional parameters in high-dimensional linear regression

Exploring the Trade-Off in the Variational Information Bottleneck for Regression with a Single Training Run

Differentiable Information Bottleneck for Deterministic Multi-view Clustering

Caveats for information bottleneck in deterministic scenarios

An Achievable and Analytic Solution to Information Bottleneck for Gaussian Mixtures

Sparse bayesian inference with regularized gaussian distributions

Uncertainty in the Variational Information Bottleneck