freePruner: A Training-free Approach for Large Multimodal Model Acceleration

Bingxin Xu,Yuzhang Shang,Yunhao Ge,Qian Lou,Yan Yan
2024-11-23
Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-language tasks but face significant deployment challenges due to their high computational demands. While recent token reduction methods show promise for accelerating LMMs, they typically require extensive retraining or fine-tuning, making them impractical for many state-of-the-art models, especially those with proprietary training data. We propose freePruner, a training-free token reduction approach that can be directly applied to any open-source LMM without additional training. Unlike existing methods that rely heavily on token merging operations, freePruner employs a two-stage token selection strategy: (1) identifying pivotal tokens that capture high-level semantic information using our designed contribution degree metric, and (2) selecting complementary tokens that preserve essential low-level visual details through attention pattern analysis. Extensive experiments demonstrate that freePruner achieves 2x acceleration while maintaining comparable performance across mainstream visual question-answering benchmarks in the training-free setting. Moreover, freePruner is orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of high computational requirements faced by large - scale multimodal models (LMMs) in practical deployment. Although these models perform excellently in vision - language tasks, their huge computational requirements limit their wide application, especially in practical applications that require fast response times and resource efficiency. Existing acceleration methods usually require a large amount of retraining or fine - tuning, which is not only time - consuming but also costly, and for many state - of - the - art models, especially those relying on proprietary training data, this method is not practical. Therefore, this paper proposes a training - free acceleration method named **freePruner**, which can be directly applied to any open - source LMM without additional training or fine - tuning. Through this method, the paper aims to provide an efficient and practical solution to reduce the computational overhead of LMMs while maintaining model performance. Specifically, the main contributions of the paper are as follows: 1. **Propose a training - free LMM acceleration paradigm**: This method is applicable to any open - source LMM without the need to access training data or further fine - tuning. 2. **Design a two - stage token selection strategy**: This strategy balances high - level semantic features and low - level visual details, achieving approximately 2 - fold acceleration while maintaining performance. 3. **Be orthogonal to existing post - training acceleration methods**: For example, quantization methods, providing additional options to further improve the efficiency of LMMs. ### Method Overview **freePruner** adopts a pure token selection strategy and avoids token merging operations. Specifically, this method contains two key components: 1. **Critical token selection**: Through the designed token contribution metric, identify tokens that capture high - level semantic information. These tokens are extracted from multiple Transformer layers and concentrated in the middle layers. 2. **Complementary token selection**: Based on the attention pattern of the last layer, select tokens that are highly correlated with the critical tokens to preserve important low - level visual details. ### Experimental Results The experimental results show that **freePruner** performs excellently in six visual question - answering and reasoning benchmark tests, not only being on a par with the performance of the original LLaV A - 1.5 model, but even surpassing it in some benchmark tests. In particular, in tasks such as POPE and ScienceQA, **freePruner** performs particularly well. In addition, this method also shows good scalability and generalization ability, and as the number of selected tokens increases, the performance gradually improves. In conclusion, through proposing **freePruner**, this paper provides an efficient and practical solution to solve the problem of high computational requirements in the practical deployment of LMMs, paving the way for the wide application of multimodal models.