Abstract:In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: <a class="link-external link-https" href="https://ma-lab-berkeley.github.io/CRATE" rel="external noopener nofollow">this https URL</a> .

Compressing Transformers: Features Are Low-Rank, but Weights Are Not!

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

On Compressing Deep Models by Low Rank and Sparse Decomposition.

Towards Efficient Network Compression Via Few-Shot Slimming.

A Survey on Transformer Compression

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

Sparse Binary Transformers for Multivariate Time Series Modeling

A Fast Transformer-based General-Purpose Lossless Compressor

DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization

Dense Vision Transformer Compression with Few Samples

Convolutional Neural Network Compression Based on Low-Rank Decomposition

Compressing Transformer-based self-supervised models for speech processing

ADA-Tucker: Compressing Deep Neural Networks via Adaptive Dimension Adjustment Tucker Decomposition

Convolutional neural networks compression with low rank and sparse tensor decompositions

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

Towards Efficient Tensor Decomposition-Based DNN Model Compression with Optimization Framework

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs