Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Shiyu Zhao,Zhenting Wang,Felix Juefei-Xu,Xide Xia,Miao Liu,Xiaofang Wang,Mingfu Liang,Ning Zhang,Dimitris N. Metaxas,Licheng Yu
2024-12-08
Abstract:Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our approach can significantly accelerate those popular MLLMs, e.g. LLaVA, and InternVL2 models, by more than $2 \times$ without performance drops. Our approach also far outperforms other token reduction methods when budgets are limited, achieving a better trade-off between efficiency and effectiveness.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of excessively high computational costs faced by multimodal large - language models (MLLMs) when processing high - resolution images. Specifically, the paper focuses on two scenarios: 1. **Reducing computational costs without significantly degrading performance**: By optimizing the reduction strategy of vision tokens, significantly reduce the consumption of computational resources while maintaining the performance of the model. 2. **Improving performance within a given budget**: In the case of limited computational resources, improve the performance of the model by optimizing the reduction strategy of vision tokens. ### Main problems and solutions #### Problem description Multimodal large - language models (MLLMs) usually encode input images into vision tokens and feed them into pre - trained language models for processing. However, as the image resolution increases, the number of vision tokens grows quadratically, resulting in huge computational costs, which limits the efficiency of MLLMs in practical applications. #### Solutions To meet this challenge, the authors propose two methods: 1. **G - Search (Greedy Search Algorithm)**: - **Assumption**: The importance of vision tokens remains similar in each layer, and the number of key vision tokens does not increase with the increase of the number of layers. - **Method**: Rank the vision tokens based on the attention scores of the previous layer, and use Bayesian Optimization to find the optimal retention rate for each layer, thereby achieving the optimal vision token reduction strategy. - **Result**: G - Search can significantly accelerate MLLMs, such as the LLaVA and InternVL2 models, with an acceleration ratio of more than 2 times without affecting performance. 2. **P - Sigmoid (Parameterized Sigmoid Function)**: - **Assumption**: The retention rates of vision tokens in different layers can be fitted to an S - shaped curve. - **Method**: Based on the results of G - Search, design a parameterized Sigmoid function to guide the vision token reduction in each layer, and optimize its parameters through Bayesian Optimization to maximize performance within a given budget. - **Result**: P - Sigmoid significantly outperforms other token reduction methods under a limited budget, achieving a better trade - off between efficiency and effectiveness. ### Experimental verification The authors conducted extensive experiments on multiple popular benchmark datasets to verify the effectiveness of these two methods. The results show that G - Search and P - Sigmoid can not only significantly reduce computational costs but also improve model performance in some cases. ### Summary This paper proposes a method for automatically searching for the optimal vision token reduction strategy, aiming to improve the efficiency of MLLMs. Through G - Search and P - Sigmoid, the authors successfully solved the problem of excessively high computational costs of multimodal large - language models when processing high - resolution images, and demonstrated the wide applicability and superior performance of these methods on different models and tasks.