Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-language tasks but face significant deployment challenges due to their high computational demands. While recent token reduction methods show promise for accelerating LMMs, they typically require extensive retraining or fine-tuning, making them impractical for many state-of-the-art models, especially those with proprietary training data. We propose freePruner, a training-free token reduction approach that can be directly applied to any open-source LMM without additional training. Unlike existing methods that rely heavily on token merging operations, freePruner employs a two-stage token selection strategy: (1) identifying pivotal tokens that capture high-level semantic information using our designed contribution degree metric, and (2) selecting complementary tokens that preserve essential low-level visual details through attention pattern analysis. Extensive experiments demonstrate that freePruner achieves 2x acceleration while maintaining comparable performance across mainstream visual question-answering benchmarks in the training-free setting. Moreover, freePruner is orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.

What problem does this paper attempt to address?

This paper attempts to solve the problem of high computational requirements faced by large - scale multimodal models (LMMs) in practical deployment. Although these models perform excellently in vision - language tasks, their huge computational requirements limit their wide application, especially in practical applications that require fast response times and resource efficiency. Existing acceleration methods usually require a large amount of retraining or fine - tuning, which is not only time - consuming but also costly, and for many state - of - the - art models, especially those relying on proprietary training data, this method is not practical. Therefore, this paper proposes a training - free acceleration method named **freePruner**, which can be directly applied to any open - source LMM without additional training or fine - tuning. Through this method, the paper aims to provide an efficient and practical solution to reduce the computational overhead of LMMs while maintaining model performance. Specifically, the main contributions of the paper are as follows: 1. **Propose a training - free LMM acceleration paradigm**: This method is applicable to any open - source LMM without the need to access training data or further fine - tuning. 2. **Design a two - stage token selection strategy**: This strategy balances high - level semantic features and low - level visual details, achieving approximately 2 - fold acceleration while maintaining performance. 3. **Be orthogonal to existing post - training acceleration methods**: For example, quantization methods, providing additional options to further improve the efficiency of LMMs. ### Method Overview **freePruner** adopts a pure token selection strategy and avoids token merging operations. Specifically, this method contains two key components: 1. **Critical token selection**: Through the designed token contribution metric, identify tokens that capture high - level semantic information. These tokens are extracted from multiple Transformer layers and concentrated in the middle layers. 2. **Complementary token selection**: Based on the attention pattern of the last layer, select tokens that are highly correlated with the critical tokens to preserve important low - level visual details. ### Experimental Results The experimental results show that **freePruner** performs excellently in six visual question - answering and reasoning benchmark tests, not only being on a par with the performance of the original LLaV A - 1.5 model, but even surpassing it in some benchmark tests. In particular, in tasks such as POPE and ScienceQA, **freePruner** performs particularly well. In addition, this method also shows good scalability and generalization ability, and as the number of selected tokens increases, the performance gradually improves. In conclusion, through proposing **freePruner**, this paper provides an efficient and practical solution to solve the problem of high computational requirements in the practical deployment of LMMs, paving the way for the wide application of multimodal models.

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models

FoPru: Focal Pruning for Efficient Large Vision-Language Models

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer