OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

Wei Chen,Zhiyuan Li,Shuo Xin
2024-12-16
Abstract:We present OmniVLM, a sub-billion-parameter vision-language model for efficient on-device inference. OmniVLM introduces a token compression mechanism that reduces visual token sequence length from 729 to 81 tokens, significantly reducing computational overhead while preserving visual-semantic fidelity. Through a multi-stage training pipeline of pretraining, supervised fine-tuning, and minimal-edit Direct Preference Optimization (DPO), OmniVLM matches the performance of larger models. On multiple benchmarks including ScienceQA, POPE, and MMMU, OmniVLM outperforms existing baselines like nanoLLAVA within a 968M-parameter footprint. Empirical results on the same laptop demonstrate 9.1x faster time-to-first-token (0.75s vs 6.82s) and 1.5x higher decoding speed (29.41 vs 19.20 tokens/s) compared to nanoLLAVA, enabling efficient deployment on edge devices. The model weights can be accessed on huggingface: \url{<a class="link-external link-https" href="https://huggingface.co/NexaAIDev/OmniVLM-968M" rel="external noopener nofollow">this https URL</a>}, and the inference examples can be find in Appendix B.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve three major challenges in efficiently deploying vision - language models (VLMs) on edge devices (such as smart phones, laptops, and embedded systems): 1. **The large computational overhead introduced by visual input tokenization**: - According to OpenAI's analysis, processing a 1,024×1,024 - pixel image requires 765 tokens. Such a high number of tokens leads to a significant computational burden, especially in real - time applications. 2. **Power consumption limitations**: - Edge devices usually have strict energy limitations. For example, a model with 7B parameters consumes about 0.7J of energy per token processed. For a 1,024×1,024 - pixel image (765 tokens), this will consume approximately 536J of energy, which is more than 1% of the iPhone's battery capacity, not including the energy consumption of text processing. 3. **Limitations of existing VLMs in understanding visual content**: - Existing VLMs with less than 2B parameters perform poorly in visual understanding. For example, nanoLLAVA with 1B parameters has an accuracy of only 28.6% on the MMMU benchmark, far lower than 92.3% of OpenAI's O1 model. To solve these problems, the authors propose the OmniVLM model, a 968M - parameter multimodal model specifically optimized for deployment on edge devices. Its main contributions include: - **Introducing a new token compression mechanism**: Reducing the number of image tokens from 729 to 81, reducing the number of tokens by 9 times while maintaining the fidelity of visual semantics. - **Using Direct Preference Optimization (DPO) to enhance output quality**: Fine - tuning the output of the base model with minimal edits to improve the accuracy of responses and reduce hallucinations. Through these improvements, OmniVLM performs well in multiple benchmark tests and can achieve efficient inference and deployment on resource - constrained devices. Specifically, OmniVLM outperforms the existing baseline model nanoLLAVA in benchmark tests such as ScienceQA, POPE, and MMMU, and the inference speed on actual hardware is also significantly improved.