Abstract:We present OmniVLM, a sub-billion-parameter vision-language model for efficient on-device inference. OmniVLM introduces a token compression mechanism that reduces visual token sequence length from 729 to 81 tokens, significantly reducing computational overhead while preserving visual-semantic fidelity. Through a multi-stage training pipeline of pretraining, supervised fine-tuning, and minimal-edit Direct Preference Optimization (DPO), OmniVLM matches the performance of larger models. On multiple benchmarks including ScienceQA, POPE, and MMMU, OmniVLM outperforms existing baselines like nanoLLAVA within a 968M-parameter footprint. Empirical results on the same laptop demonstrate 9.1x faster time-to-first-token (0.75s vs 6.82s) and 1.5x higher decoding speed (29.41 vs 19.20 tokens/s) compared to nanoLLAVA, enabling efficient deployment on edge devices. The model weights can be accessed on huggingface: \url{<a class="link-external link-https" href="https://huggingface.co/NexaAIDev/OmniVLM-968M" rel="external noopener nofollow">this https URL</a>}, and the inference examples can be find in Appendix B.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve three major challenges in efficiently deploying vision - language models (VLMs) on edge devices (such as smart phones, laptops, and embedded systems): 1. **The large computational overhead introduced by visual input tokenization**: - According to OpenAI's analysis, processing a 1,024×1,024 - pixel image requires 765 tokens. Such a high number of tokens leads to a significant computational burden, especially in real - time applications. 2. **Power consumption limitations**: - Edge devices usually have strict energy limitations. For example, a model with 7B parameters consumes about 0.7J of energy per token processed. For a 1,024×1,024 - pixel image (765 tokens), this will consume approximately 536J of energy, which is more than 1% of the iPhone's battery capacity, not including the energy consumption of text processing. 3. **Limitations of existing VLMs in understanding visual content**: - Existing VLMs with less than 2B parameters perform poorly in visual understanding. For example, nanoLLAVA with 1B parameters has an accuracy of only 28.6% on the MMMU benchmark, far lower than 92.3% of OpenAI's O1 model. To solve these problems, the authors propose the OmniVLM model, a 968M - parameter multimodal model specifically optimized for deployment on edge devices. Its main contributions include: - **Introducing a new token compression mechanism**: Reducing the number of image tokens from 729 to 81, reducing the number of tokens by 9 times while maintaining the fidelity of visual semantics. - **Using Direct Preference Optimization (DPO) to enhance output quality**: Fine - tuning the output of the base model with minimal edits to improve the accuracy of responses and reduce hallucinations. Through these improvements, OmniVLM performs well in multiple benchmark tests and can achieve efficient inference and deployment on resource - constrained devices. Specifically, OmniVLM outperforms the existing baseline model nanoLLAVA in benchmark tests such as ScienceQA, POPE, and MMMU, and the inference speed on actual hardware is also significantly improved.

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

VoCo-LLaMA: Towards Vision Compression with Large Language Models

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance

Inference Optimal VLMs Need Only One Visual Token but Larger Models

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

VOLO: Vision Outlooker for Visual Recognition

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

OmniFusion Technical Report

EVLM: An Efficient Vision-Language Model for Visual Understanding

NVLM: Open Frontier-Class Multimodal LLMs

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks