Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Zeliang Zhang,Phu Pham,Wentian Zhao,Kun Wan,Yu-Jhe Li,Jianing Zhou,Daniel Miranda,Ajinkya Kale,Chenliang Xu
2024-11-16
Abstract:By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language Models (LLMs). However, as token counts grow, the quadratic scaling of computation in LLMs introduces a significant efficiency bottleneck, impeding further scalability. Although recent approaches have explored pruning visual tokens or employing lighter LLM architectures, the computational overhead from an increasing number of visual tokens remains a substantial challenge. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA, a representative MLLM, and introduce a suite of streamlined strategies to enhance efficiency. These include neighbor-aware visual token attention, pruning of inactive visual attention heads, and selective layer dropping for visual computations. By implementing these strategies in LLaVA, we achieve a reduction in computational demands of 88% while maintaining model performance across key benchmarks. Additionally, we validate the existence of visual computational redundancy in other MLLMs, such as Qwen2-VL-7B and InternVL-2.0-4B/8B/26B. These results present a novel pathway for MLLMs to handle dense visual tokens with minimal computational costs. Code and model checkpoints will be released to support further research.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the computational efficiency bottleneck faced when processing visual tokens in multimodal large language models (MLLMs). Specifically, as the number of visual tokens increases, the computational cost grows quadratically, which limits the further expansion of MLLMs. Although existing methods have attempted to solve this problem by pruning visual tokens or using a more lightweight LLM architecture, the computational overhead is still large. This paper studies the visual computational redundancy in LLaVA (a representative MLLM) and proposes a series of optimization strategies aimed at significantly reducing computational requirements while maintaining model performance. These strategies include a neighbor - aware visual token attention mechanism, pruning of inactive visual attention heads, and selective discarding of visual computation layers. Through these strategies, the authors verify the effectiveness of their method on multiple benchmarks, demonstrating that model performance can be maintained while significantly reducing computational costs. In addition, the authors also find that a similar visual computational redundancy phenomenon exists in other MLLMs (such as Qwen2 - VL - 7B and InternVL - 2.0 - 4B/8B/26B), indicating that these optimization strategies have broad application potential.