BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Xudong Lu,Yinghao Chen,Cheng Chen,Hui Tan,Boheng Chen,Yina Xie,Rui Hu,Guanxin Tan,Renshou Wu,Yan Hu,Yi Zeng,Lei Wu,Liuyang Bian,Zhaoxiong Wang,Long Liu,Yanzhou Yang,Han Xiao,Aojun Zhou,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li
2024-11-16
Abstract:The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) Strong Performance: BlueLM-V-3B has attained the highest average score of 66.1 on the OpenCompass benchmark among models with $\leq$ 4B parameters and surpassed a series of models with much larger parameter sizes (e.g., MiniCPM-V-2.6, InternVL2-8B).
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the efficient deployment of multimodal large language models (MLLMs) on mobile devices. Specifically, the paper focuses on the following challenges: 1. **Memory Limitations**: The memory capacity of mobile devices is limited, which restricts the deployment of large - parameter models. For example, a 4 - bit quantized LLaMA 7B model requires approximately 4.5 GB of memory, which may affect the smooth operation of the system. 2. **Computing Power Limitations**: The computing power of mobile processors is limited, which restricts the inference speed. For example, on a MediaTek Dimensity 9300 processor, a 4 - bit quantized LLaMA 7B model can only generate about 10 - 15 tokens per second, which restricts its applicability in real - time applications. 3. **Problems with Dynamic Image Resolution Strategies**: Mainstream MLLMs usually use dynamic image resolution strategies to enhance the understanding of high - resolution images, but this will lead to multiple ViT inferences and excessive image tokens, thus affecting the image processing speed and overall latency. To solve these problems, the paper proposes BlueLM - V - 3B, which is an algorithm - and - system co - design method aimed at improving the deployment efficiency of MLLMs on mobile devices. Specific improvements include: - **Algorithm Design**: - **Relaxed Aspect Ratio Matching Method**: By introducing a threshold to prevent always choosing a larger resolution, the number of image tokens is reduced, thereby reducing the complexity of training and deployment. - **Batch Image Block Encoding**: During the training process, the GPU parallelism is utilized by batch - processing image blocks to accelerate the processing. - **Pipeline Parallelism**: During the inference process, the encoding of image blocks is optimized by designing parallel pipelines of convolutional layers and vision transformer blocks. - **System Design**: - **Token Downsampling**: The number of image tokens is reduced through the downsampling module, making it more suitable for deployment on resource - constrained hardware. - **Block - wise Computation of Input Tokens**: During the inference process, the input tokens are processed in blocks to balance parallel processing and the computing resources of the NPU. - **Mixed - Precision Quantization**: Memory usage is reduced and inference speed is increased through mixed - precision quantization while maintaining the robustness of model performance. - **Decoupling of Image Encoding and Instruction Processing**: By processing images and user instructions in parallel, the waiting time is reduced, the overall response speed is increased, and the peak memory usage is limited. These improvements enable BlueLM - V - 3B to achieve efficient deployment on mobile devices with higher performance and lower resource consumption.