FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

Bo Tong,Bokai Lai,Yiyi Zhou,Gen Luo,Yunhang Shen,Ke Li,Xiaoshuai Sun,Rongrong Ji
2024-12-06
Abstract:Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problems of slow response speed and high consumption of computational resources in multimodal large language models (MLLMs) in practical applications. Specifically, although MLLMs have been significantly improved in capabilities, they behave like "sloths" in actual use, with slow response speed and high latency. To solve this problem, the author proposes a fast and powerful small - scale MLLM - FlashSloth. ### Main Problems 1. **Slow Response Speed**: Existing MLLMs have a long response time during the inference process due to the use of a large number of visual tokens. 2. **High Consumption of Computational Resources**: A large number of visual tokens not only increase the occupation of GPU memory but also raise the computational complexity, which significantly increases the costs of training and inference. ### Solutions FlashSloth solves the above problems through the following methods: - **Embedded Visual Compression**: Capture visually salient semantics through the Spatial Attention Pooling (SAP) module and compress redundant visual tokens. At the same time, introduce the Embedded Query (EmbQ) module to obtain instruction - related image information. - **Reduce the Number of Visual Tokens**: Through the above - mentioned compression techniques, FlashSloth can significantly reduce the number of input visual tokens, thereby increasing the inference speed and reducing the computational complexity. - **Efficient and Light - weight Design**: The overall architecture of FlashSloth is compactly designed, without the need for additional language modeling or special alignment pre - training, further improving the efficiency. ### Experimental Results The experimental results show that, compared with other advanced small - scale MLLMs (such as InternVL2, MiniCPM - V2, and Qwen2 - VL), FlashSloth can significantly reduce the number of visual tokens, training memory, and inference computational complexity while maintaining high performance, and shorten the actual response time by about 2 to 5 times. ### Summary The main contributions of FlashSloth are as follows: 1. Propose a fast and powerful small - scale MLLM, demonstrating a good balance between performance and efficiency. 2. Introduce an embedded visual compression design, which can efficiently capture visually salient and instruction - related semantics. 3. Verify the strong multimodal capabilities and higher efficiency of FlashSloth through extensive experiments. These improvements make FlashSloth more practical and competitive in practical applications.