Abstract:Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks.

What problem does this paper attempt to address?

This paper attempts to solve the problems of slow response speed and high consumption of computational resources in multimodal large language models (MLLMs) in practical applications. Specifically, although MLLMs have been significantly improved in capabilities, they behave like "sloths" in actual use, with slow response speed and high latency. To solve this problem, the author proposes a fast and powerful small - scale MLLM - FlashSloth. ### Main Problems 1. **Slow Response Speed**: Existing MLLMs have a long response time during the inference process due to the use of a large number of visual tokens. 2. **High Consumption of Computational Resources**: A large number of visual tokens not only increase the occupation of GPU memory but also raise the computational complexity, which significantly increases the costs of training and inference. ### Solutions FlashSloth solves the above problems through the following methods: - **Embedded Visual Compression**: Capture visually salient semantics through the Spatial Attention Pooling (SAP) module and compress redundant visual tokens. At the same time, introduce the Embedded Query (EmbQ) module to obtain instruction - related image information. - **Reduce the Number of Visual Tokens**: Through the above - mentioned compression techniques, FlashSloth can significantly reduce the number of input visual tokens, thereby increasing the inference speed and reducing the computational complexity. - **Efficient and Light - weight Design**: The overall architecture of FlashSloth is compactly designed, without the need for additional language modeling or special alignment pre - training, further improving the efficiency. ### Experimental Results The experimental results show that, compared with other advanced small - scale MLLMs (such as InternVL2, MiniCPM - V2, and Qwen2 - VL), FlashSloth can significantly reduce the number of visual tokens, training memory, and inference computational complexity while maintaining high performance, and shorten the actual response time by about 2 to 5 times. ### Summary The main contributions of FlashSloth are as follows: 1. Propose a fast and powerful small - scale MLLM, demonstrating a good balance between performance and efficiency. 2. Introduce an embedded visual compression design, which can efficiently capture visually salient and instruction - related semantics. 3. Verify the strong multimodal capabilities and higher efficiency of FlashSloth through extensive experiments. These improvements make FlashSloth more practical and competitive in practical applications.

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

Efficient Large Multi-modal Models via Visual Context Compression

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

Efficient Multi-modal Large Language Models via Visual Token Grouping

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Inference Optimal VLMs Need Only One Visual Token but Larger Models

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

InfMLLM: A Unified Framework for Visual-Language Tasks.

TokenPacker: Efficient Visual Projector for Multimodal LLM

FlashDecoding++: Faster Large Language Model Inference on GPUs