Abstract:Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address this, we introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance. Unlike previous methods that rely heavily on attention mechanisms and overlooking cross-modal interactions , we uses a prompt-aware strategy to adpative identify and cluster essential visual tokens. PAR categorizes visual context redundancy into two types: external and internal. External redundancy is minimized through semantic retrieval, while internal redundancy is addressed using a token routing mechanism. This method substantially reduces computational load without requiring additional training or complex architectural modifications. \textbf{Experimental results demonstrate that across various visual question answering tasks, PAR reduces FLOPs by 83\% with a compression ratio of 89\%, while retaining 97\% of baseline accuracy.} The adaptive design of PAR achieves a 2x token reduction ratio compared to prior approaches, enabling a better balance between performance and efficiency.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive computational and memory requirements when multimodal large language models (MLLMs) handle long - context multimodal inputs. Specifically, MLLMs show strong performance in handling visual tasks, but due to the need to process a large number of visual tokens, the consumption of computational resources is huge. Therefore, how to reduce the number of visual tokens without significantly affecting the generation performance has become a key challenge that needs to be solved urgently. #### Main problems 1. **Excessive computational and memory requirements**: When MLLMs handle multimodal inputs with long - context, especially high - dimensional data such as images, the computational and memory overhead is very large. 2. **Redundancy problem**: Existing token reduction methods usually rely on the attention mechanism, but these methods fail to fully consider cross - modal interaction and are prone to introduce too much redundant information, thus affecting task accuracy. #### Solutions To solve the above problems, the author proposes PAR (Prompt - Aware Token Reduction), a novel and plug - and - play method, which effectively reduces visual tokens in the following ways: - **External redundancy elimination**: Minimize task - irrelevant visual tokens through semantic retrieval. - **Internal redundancy elimination**: Use the token routing mechanism to simplify the retained tokens and remove duplicate or similar tokens. - **Prompt - aware strategy**: Use prompts to adaptively identify and cluster important visual tokens to ensure task relevance. #### Experimental results The experimental results show that PAR significantly reduces FLOPs (floating - point operations) and the compression ratio in multiple visual question - answering tasks while maintaining a 97% baseline accuracy. Specifically: - FLOPs are reduced by 83% - The compression ratio reaches 89% - The number of tokens is reduced by about 2 times These improvements enable PAR to achieve a better balance between efficiency and performance, especially in handling the hallucination phenomenon. ### Summary The core objective of the paper is to solve the computational and memory efficiency problems of multimodal large models in handling visual tasks by proposing the PAR method while maintaining a high level of task performance.

PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

TokenPacker: Efficient Visual Projector for Multimodal LLM

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Efficient Large Multi-modal Models via Visual Context Compression

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Efficient Multi-modal Large Language Models via Visual Token Grouping

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Rethinking Visual Prompt Learning as Masked Visual Token Modeling

Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction

Modular and Parameter-Efficient Multimodal Fusion with Prompting