Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.

What problem does this paper attempt to address?

This paper attempts to solve the problem that multimodal large language models (MLLMs) have a significant increase in resource consumption while their performance is improved. Specifically, with the development of MLLMs, although they have shown excellent or even super - human performance in multiple fields, the computational resources and memory consumption of these models have also increased significantly. This not only limits the wide application of the models but also poses a challenge to environmental sustainability. To solve this problem, the author proposes a new method - Token Reduction using CLIP Metric (TRIM). TRIM aims to improve the efficiency of MLLMs by reducing the number of image tokens without sacrificing model performance. Inspired by the attention patterns of humans in visual question - answering tasks, this method uses the CLIP model to calculate the similarity between text and image patches, thereby selecting and reducing image tokens. Experimental results show that TRIM can significantly reduce computational overhead on 12 datasets while maintaining a consistent performance level. The following are the core steps of the TRIM method: 1. **Evaluate the Importance of Tokens**: - Use the CLIP model to calculate the cosine similarity between each image token \( v_i \) and the pooled text representation \( u_{\text{pooled}} \): \[ S(v_i, u_{\text{pooled}}) = \frac{v_i \cdot u_{\text{pooled}}}{\|v_i\| \|u_{\text{pooled}}\|} \] - Apply the softmax function to convert the similarity into a probability distribution: \[ S_{\text{softmax}}(v_i, u_{\text{pooled}}) = \frac{e^{S(v_i, u_{\text{pooled}})}}{\sum_j e^{S(v_j, u_{\text{pooled}})}} \] 2. **Select Important Tokens**: - Use the interquartile range (IQR) method to determine the number of image tokens to be retained. According to the lower and upper quartiles (Q1 and Q3) of the similarity scores, set a strict similarity threshold \( Q3 + 1.5\times IQR \), and only retain the tokens whose similarity scores exceed this threshold. 3. **Aggregate Unselected Tokens**: - Calculate the average representation of the unselected tokens and attach it as an aggregated token to the selected token sequence to preserve image information. Through these steps, TRIM can significantly reduce the number of image tokens (about 79%), thereby reducing the computation time and memory usage (by 67% and 30% respectively), while maintaining performance comparable to the baseline model.

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Efficient Multi-modal Large Language Models via Visual Token Grouping

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction

LIME: Less Is More for MLLM Evaluation

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

Efficient Large Multi-modal Models via Visual Context Compression

TokenPacker: Efficient Visual Projector for Multimodal LLM

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving