Abstract:Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at <a class="link-external link-https" href="https://github.com/ywh187/FitPrune" rel="external noopener nofollow">this https URL</a>.

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

FoPru: Focal Pruning for Efficient Large Vision-Language Models

[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens.

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration