Abstract:Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at <a class="link-external link-https" href="https://github.com/ywh187/FitPrune" rel="external noopener nofollow">this https URL</a>.

TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models

A Fast Post-Training Pruning Framework for Transformers

Pruning before Fine-tuning: A Retraining-free Compression Framework for Pre-trained Language Models

Gradient-based Intra-attention Pruning on Pre-trained Language Models

Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads

Pruning Pre-trained Language Models Without Fine-Tuning

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

Pruning as a Domain-specific LLM Extractor

Efficient Speech-to-Text Translation: Progressive Pruning for Accelerated Speech Pre-trained Model

Can pruning make Large Language Models more efficient?

LLM-Pruner: On the Structural Pruning of Large Language Models

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

BlockPruner: Fine-grained Pruning for Large Language Models

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

Prune Once for All: Sparse Pre-Trained Language Models

Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations

Structured Pruning of Large Language Models

Pruning Pre-trained Language Models with Principled Importance and Self-regularization