Bo Li,Yuanhan Zhang,Dong Guo,Renrui Zhang,Feng Li,Hao Zhang,Kaichen Zhang,Peiyuan Zhang,Yanwei Li,Ziwei Liu,Chunyuan Li
Abstract:We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper attempts to address the problem of how to construct a general large multimodal model (LMM) that can handle single images, multiple images, and videos simultaneously in computer vision tasks. Specifically, the paper proposes a model named LLaV A-OneVision, aiming to achieve the following goals through cross-modal/scene task transfer:
1. **Improve Performance Boundaries**: Push the performance limits of open-source LMMs in the three important scenarios of single images, multiple images, and videos.
2. **Achieve Task Transfer**: By designing model and data representation methods, enable the model to transfer tasks between different scenarios, thereby generating new capabilities. In particular, demonstrate strong video understanding capabilities by transferring from image tasks to video tasks.
3. **Open Source**: To promote the development of general vision assistants, the authors release the generated multimodal instruction data, codebase, model checkpoints, and visual chat demonstrations.
### Main Contributions
- **Large Multimodal Model**: Developed LLaV A-OneVision, an open large multimodal model series that improves performance boundaries in single image, multiple image, and video scenarios.
- **Emerging Capabilities from Task Transfer**: Demonstrated new capabilities through cross-scene task transfer, especially in video understanding.
- **Open Source**: Released multimodal instruction data, codebase, model checkpoints, and visual chat demonstrations to promote the development of the research community.
### Background and Motivation
Most existing multimodal models are usually optimized for specific scenarios, such as single images, multiple images, or videos. However, there are few open-source models that perform well in all these scenarios. LLaV A-OneVision aims to fill this gap by demonstrating state-of-the-art performance in a wide range of tasks through cross-scene task transfer and combination, and showcasing interesting emerging capabilities.
### Methodology
- **Network Architecture**: Inherited the minimalist design of the LLaV A series, utilizing pre-trained large language models (LLM) and visual encoders, and mapping image features to the word embedding space through a projection layer.
- **Visual Representation**: Optimized the representation of visual signals by adjusting resolution and the number of features, proposing the AnyRes strategy to balance performance and cost.
- **Datasets**: Emphasized the principle of "quality over quantity," collecting high-quality knowledge learning data and visual instruction tuning data, including regenerated detailed description data, document/OCR data, and Chinese and language data.
- **Training Strategy**: Systematically divided the training process into three stages corresponding to different functions for ablation studies.
### Conclusion
LLaV A-OneVision demonstrates strong performance and emerging capabilities in various computer vision tasks through cross-modal task transfer, providing new ideas and methods for building general vision assistants.