Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

Ting Liu,Liangtao Shi,Richang Hong,Yue Hu,Quanjun Yin,Linfeng Zhang

2024-11-16

Abstract:The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their inference efficiency. To solve this problem, some recent works were introduced to drop the unimportant tokens during inference where the importance of each token is decided only by the information in either the vision encoding stage or the prefilling stage. In this paper, we propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle, including the vision encoding stage, prefilling stage, and decoding stage. Concretely, in the visual encoding stage, MustDrop merges spatially adjacent tokens with high similarity, and establishes a key token set to retain the most vision-critical tokens, preventing them from being discarded in later stages. In the prefilling stage, MustDrop further compresses vision tokens by the guidance of text semantics, with a dual-attention filtering strategy. In the decoding stage, an output-aware cache policy is proposed to further reduce the size of the KV cache. By leveraging tailored strategies in the multi-stage process, MustDrop can more precisely recognize the important and redundant tokens, thus achieving an optimal balance between performance and efficiency. For instance, MustDrop reduces about 88.5\% FLOPs on LLaVA with a compression ratio of 92.2\% while maintaining comparable accuracy. Our codes are available at \url{<a class="link-external link-https" href="https://github.com/liuting20/MustDrop" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in multi - modal large language models (MLLMs), there is significant spatial and temporal redundancy in visual tokens. These redundancies occupy most of the input tokens and affect the inference efficiency. Specifically, since visual tokens in close temporal and spatial positions usually show similar information, this leads to information repetition. This redundancy is particularly evident for high - resolution images and long - video, which not only increases memory consumption but also raises the computational cost. To solve this problem, the paper proposes a method named Multi - stage Token Dropping (MustDrop), which aims to evaluate the importance of each token throughout its entire life cycle (including the visual encoding stage, the pre - filling stage, and the decoding stage), thereby more accurately identifying important and redundant tokens and achieving the optimal balance between performance and efficiency. For example, MustDrop reduces the number of floating - point operations (FLOPs) by approximately 88.5% on the LLaVA model while maintaining comparable accuracy.

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Efficient Multi-modal Large Language Models via Visual Token Grouping

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Multi-Scale And Token Mergence: Make Your ViT More Efficient

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

FoPru: Focal Pruning for Efficient Large Vision-Language Models

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Multimodal Token Fusion for Vision Transformers

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation