Abstract:In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% to 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy. Our code is available at <a class="link-external link-https" href="https://github.com/Gumpest/SparseVLMs" rel="external noopener nofollow">this https URL</a>.

Learning Sparse Overcomplete Word Vectors Without Intermediate Dense Representations

Privacy-Preserving Collaborative Model Learning: the Case of Word Vector Training

Using Context-to-Vector with Graph Retrofitting to Improve Word Embeddings

Compressing Neural Language Models by Sparse Word Representations

Sparse word embeddings using l1 regularized online learning

Sparse Lifting of Dense Vectors: Unifying Word and Sentence Representations

Word Equations: Inherently Interpretable Sparse Word Embeddingsthrough Sparse Coding

Learning Word Representations with Hierarchical Sparse Coding

Biologically Plausible Sparse Temporal Word Representations

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Interpretable Neural Embeddings with Sparse Self-Representation

Deriving Word Vectors from Contextualized Language Models using Topic-Aware Mention Selection

SPINE: SParse Interpretable Neural Embeddings

Enriching Word Vectors with Subword Information

Learning Sparse Latent Representations for Generator Model

Word Vector Enrichment of Low Frequency Words in the Bag-of-Words Model for Short Text Multi-class Classification Problems

SparseGAN: Sparse Generative Adversarial Network for Text Generation

Learning Regularized Noise Contrastive Estimation for Robust Network Embedding.

Non-distributional Word Vector Representations

SparseVSR: Lightweight and Noise Robust Visual Speech Recognition

Learning Word Embeddings from Intrinsic and Extrinsic Views