Abstract:The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) Strong Performance: BlueLM-V-3B has attained the highest average score of 66.1 on the OpenCompass benchmark among models with $\leq$ 4B parameters and surpassed a series of models with much larger parameter sizes (e.g., MiniCPM-V-2.6, InternVL2-8B).

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the efficient deployment of multimodal large language models (MLLMs) on mobile devices. Specifically, the paper focuses on the following challenges: 1. **Memory Limitations**: The memory capacity of mobile devices is limited, which restricts the deployment of large - parameter models. For example, a 4 - bit quantized LLaMA 7B model requires approximately 4.5 GB of memory, which may affect the smooth operation of the system. 2. **Computing Power Limitations**: The computing power of mobile processors is limited, which restricts the inference speed. For example, on a MediaTek Dimensity 9300 processor, a 4 - bit quantized LLaMA 7B model can only generate about 10 - 15 tokens per second, which restricts its applicability in real - time applications. 3. **Problems with Dynamic Image Resolution Strategies**: Mainstream MLLMs usually use dynamic image resolution strategies to enhance the understanding of high - resolution images, but this will lead to multiple ViT inferences and excessive image tokens, thus affecting the image processing speed and overall latency. To solve these problems, the paper proposes BlueLM - V - 3B, which is an algorithm - and - system co - design method aimed at improving the deployment efficiency of MLLMs on mobile devices. Specific improvements include: - **Algorithm Design**: - **Relaxed Aspect Ratio Matching Method**: By introducing a threshold to prevent always choosing a larger resolution, the number of image tokens is reduced, thereby reducing the complexity of training and deployment. - **Batch Image Block Encoding**: During the training process, the GPU parallelism is utilized by batch - processing image blocks to accelerate the processing. - **Pipeline Parallelism**: During the inference process, the encoding of image blocks is optimized by designing parallel pipelines of convolutional layers and vision transformer blocks. - **System Design**: - **Token Downsampling**: The number of image tokens is reduced through the downsampling module, making it more suitable for deployment on resource - constrained hardware. - **Block - wise Computation of Input Tokens**: During the inference process, the input tokens are processed in blocks to balance parallel processing and the computing resources of the NPU. - **Mixed - Precision Quantization**: Memory usage is reduced and inference speed is increased through mixed - precision quantization while maintaining the robustness of model performance. - **Decoupling of Image Encoding and Instruction Processing**: By processing images and user instructions in parallel, the waiting time is reduced, the overall response speed is increased, and the peak memory usage is limited. These improvements enable BlueLM - V - 3B to achieve efficient deployment on mobile devices with higher performance and lower resource consumption.

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Imp: Highly Capable Large Multimodal Models for Mobile Devices

PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training

Large Language Models on Mobile Devices: Measurements, Analysis, and Insights

Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

ELMS: Elasticized Large Language Models On Mobile Devices

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Cloud-Device Collaborative Learning for Multimodal Large Language Models

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

PLMM: Personal Large Language Models on Mobile Devices

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

LLMCad: Fast and Scalable On-device Large Language Model Inference

MindLLM: Lightweight Large Language Model Pre-Training, Evaluation and Domain Application