Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

Dingjie Song,Wenjun Wang,Shunian Chen,Xidong Wang,Michael Guan,Benyou Wang
2024-09-28
Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.
Computation and Language,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
This paper attempts to solve the problem that multimodal large language models (MLLMs) have a significant increase in resource consumption while their performance is improved. Specifically, with the development of MLLMs, although they have shown excellent or even super - human performance in multiple fields, the computational resources and memory consumption of these models have also increased significantly. This not only limits the wide application of the models but also poses a challenge to environmental sustainability. To solve this problem, the author proposes a new method - Token Reduction using CLIP Metric (TRIM). TRIM aims to improve the efficiency of MLLMs by reducing the number of image tokens without sacrificing model performance. Inspired by the attention patterns of humans in visual question - answering tasks, this method uses the CLIP model to calculate the similarity between text and image patches, thereby selecting and reducing image tokens. Experimental results show that TRIM can significantly reduce computational overhead on 12 datasets while maintaining a consistent performance level. The following are the core steps of the TRIM method: 1. **Evaluate the Importance of Tokens**: - Use the CLIP model to calculate the cosine similarity between each image token \( v_i \) and the pooled text representation \( u_{\text{pooled}} \): \[ S(v_i, u_{\text{pooled}}) = \frac{v_i \cdot u_{\text{pooled}}}{\|v_i\| \|u_{\text{pooled}}\|} \] - Apply the softmax function to convert the similarity into a probability distribution: \[ S_{\text{softmax}}(v_i, u_{\text{pooled}}) = \frac{e^{S(v_i, u_{\text{pooled}})}}{\sum_j e^{S(v_j, u_{\text{pooled}})}} \] 2. **Select Important Tokens**: - Use the interquartile range (IQR) method to determine the number of image tokens to be retained. According to the lower and upper quartiles (Q1 and Q3) of the similarity scores, set a strict similarity threshold \( Q3 + 1.5\times IQR \), and only retain the tokens whose similarity scores exceed this threshold. 3. **Aggregate Unselected Tokens**: - Calculate the average representation of the unselected tokens and attach it as an aggregated token to the selected token sequence to preserve image information. Through these steps, TRIM can significantly reduce the number of image tokens (about 79%), thereby reducing the computation time and memory usage (by 67% and 30% respectively), while maintaining performance comparable to the baseline model.