Abstract:We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to construct a general large multimodal model (LMM) that can handle single images, multiple images, and videos simultaneously in computer vision tasks. Specifically, the paper proposes a model named LLaV A-OneVision, aiming to achieve the following goals through cross-modal/scene task transfer: 1. **Improve Performance Boundaries**: Push the performance limits of open-source LMMs in the three important scenarios of single images, multiple images, and videos. 2. **Achieve Task Transfer**: By designing model and data representation methods, enable the model to transfer tasks between different scenarios, thereby generating new capabilities. In particular, demonstrate strong video understanding capabilities by transferring from image tasks to video tasks. 3. **Open Source**: To promote the development of general vision assistants, the authors release the generated multimodal instruction data, codebase, model checkpoints, and visual chat demonstrations. ### Main Contributions - **Large Multimodal Model**: Developed LLaV A-OneVision, an open large multimodal model series that improves performance boundaries in single image, multiple image, and video scenarios. - **Emerging Capabilities from Task Transfer**: Demonstrated new capabilities through cross-scene task transfer, especially in video understanding. - **Open Source**: Released multimodal instruction data, codebase, model checkpoints, and visual chat demonstrations to promote the development of the research community. ### Background and Motivation Most existing multimodal models are usually optimized for specific scenarios, such as single images, multiple images, or videos. However, there are few open-source models that perform well in all these scenarios. LLaV A-OneVision aims to fill this gap by demonstrating state-of-the-art performance in a wide range of tasks through cross-scene task transfer and combination, and showcasing interesting emerging capabilities. ### Methodology - **Network Architecture**: Inherited the minimalist design of the LLaV A series, utilizing pre-trained large language models (LLM) and visual encoders, and mapping image features to the word embedding space through a projection layer. - **Visual Representation**: Optimized the representation of visual signals by adjusting resolution and the number of features, proposing the AnyRes strategy to balance performance and cost. - **Datasets**: Emphasized the principle of "quality over quantity," collecting high-quality knowledge learning data and visual instruction tuning data, including regenerated detailed description data, document/OCR data, and Chinese and language data. - **Training Strategy**: Systematically divided the training process into three stages corresponding to different functions for ablation studies. ### Conclusion LLaV A-OneVision demonstrates strong performance and emerging capabilities in various computer vision tasks through cross-modal task transfer, providing new ideas and methods for building general vision assistants.

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

EVLM: An Efficient Vision-Language Model for Visual Understanding

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models