Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Weihao Ye,Qiong Wu,Wenhao Lin,Yiyi Zhou
2024-09-16
Abstract:Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at <a class="link-external link-https" href="https://github.com/ywh187/FitPrune" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high computational complexity and visual token redundancy in multimodal large language models (MLLMs) when handling visual tasks. Specifically, existing multimodal large language models compensate for their deficiencies in visual understanding by using a large number of image tokens, which not only leads to obvious redundancy but also greatly increases the computational cost. For example, when the LLaVA model processes the ScienceQA dataset, it uses 576 image patches as visual tokens, and the required computational amount is 6.2 times higher than when only processing text. In addition, these widely - used visual tokens also involve obvious redundancy in MLLMs. Research has found that as the number of model layers increases, the attention of visual tokens to text gradually concentrates on a few tokens, indicating that many visual tokens are actually not active. Theoretically, removing these inactive tokens has a limited impact on model performance. Therefore, the paper proposes a novel and training - free visual token pruning method - FitPrune, which aims to quickly generate a complete pruning strategy that meets the predefined computational budget. The core idea of FitPrune is to determine the optimal pruning scheme by minimizing the difference in the attention distribution before and after pruning, thereby reducing computational complexity while maintaining high - performance of the model. Experimental results show that FitPrune can significantly reduce the computational amount of MLLMs. For example, it reduces 54.9% of FLOPs on the LLaVA - NEXT model, while the performance degradation is only 0.5%. More importantly, the pruning strategy of FitPrune can be obtained in about 5 minutes.