Abstract:Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our approach can significantly accelerate those popular MLLMs, e.g. LLaVA, and InternVL2 models, by more than $2 \times$ without performance drops. Our approach also far outperforms other token reduction methods when budgets are limited, achieving a better trade-off between efficiency and effectiveness.

What problem does this paper attempt to address?

This paper attempts to solve the problem of excessively high computational costs faced by multimodal large - language models (MLLMs) when processing high - resolution images. Specifically, the paper focuses on two scenarios: 1. **Reducing computational costs without significantly degrading performance**: By optimizing the reduction strategy of vision tokens, significantly reduce the consumption of computational resources while maintaining the performance of the model. 2. **Improving performance within a given budget**: In the case of limited computational resources, improve the performance of the model by optimizing the reduction strategy of vision tokens. ### Main problems and solutions #### Problem description Multimodal large - language models (MLLMs) usually encode input images into vision tokens and feed them into pre - trained language models for processing. However, as the image resolution increases, the number of vision tokens grows quadratically, resulting in huge computational costs, which limits the efficiency of MLLMs in practical applications. #### Solutions To meet this challenge, the authors propose two methods: 1. **G - Search (Greedy Search Algorithm)**: - **Assumption**: The importance of vision tokens remains similar in each layer, and the number of key vision tokens does not increase with the increase of the number of layers. - **Method**: Rank the vision tokens based on the attention scores of the previous layer, and use Bayesian Optimization to find the optimal retention rate for each layer, thereby achieving the optimal vision token reduction strategy. - **Result**: G - Search can significantly accelerate MLLMs, such as the LLaVA and InternVL2 models, with an acceleration ratio of more than 2 times without affecting performance. 2. **P - Sigmoid (Parameterized Sigmoid Function)**: - **Assumption**: The retention rates of vision tokens in different layers can be fitted to an S - shaped curve. - **Method**: Based on the results of G - Search, design a parameterized Sigmoid function to guide the vision token reduction in each layer, and optimize its parameters through Bayesian Optimization to maximize performance within a given budget. - **Result**: P - Sigmoid significantly outperforms other token reduction methods under a limited budget, achieving a better trade - off between efficiency and effectiveness. ### Experimental verification The authors conducted extensive experiments on multiple popular benchmark datasets to verify the effectiveness of these two methods. The results show that G - Search and P - Sigmoid can not only significantly reduce computational costs but also improve model performance in some cases. ### Summary This paper proposes a method for automatically searching for the optimal vision token reduction strategy, aiming to improve the efficiency of MLLMs. Through G - Search and P - Sigmoid, the authors successfully solved the problem of excessively high computational costs of multimodal large - language models when processing high - resolution images, and demonstrated the wide applicability and superior performance of these methods on different models and tasks.

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Efficient Multi-modal Large Language Models via Visual Token Grouping

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Efficient Large Multi-modal Models via Visual Context Compression

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Efficient Multimodal Large Language Models: A Survey

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Inference Optimal VLMs Need Only One Visual Token but Larger Models

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models