Abstract:Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at <a class="link-external link-https" href="https://github.com/LLaVA-VL/LLaVA-NeXT" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient ability of existing large - scale multimodal models (LMMs) when dealing with multi - image scenarios. Although the existing open - source LMMs have made remarkable progress in single - image tasks, there is still a lack of exploration in more complex multi - image scenario applications. In addition, previous studies usually train specific task models separately for different application scenarios, resulting in fragmented methods, low efficiency and difficulty in expansion. Therefore, the paper proposes a unified framework, aiming to enable LMMs to operate effectively across multiple visual scenarios, including multi - image, video (multi - frame), 3D (multi - view) and single - image (multi - block) scenarios, through a general data template - the image - text interleaving format. Specifically, the main contributions of the paper include: - **Unifying different tasks with interleaved data format**: Convert multi - image, video, 3D and single - image data into an interleaved training format, thereby unifying different tasks in one LMM. - **New datasets and benchmarks**: Compile a high - quality training dataset M4 - Instruct, which contains 11.776 million samples, covering 14 tasks and 41 datasets in 4 major domains (multi - image, video, 3D and single - image). At the same time, also organize the LLaVA - Interleave Bench, a diverse set of benchmarks for evaluating multi - image performance, including 7 newly collected and 13 existing in - domain and out - domain benchmarks. - **State - of - the - art performance**: Through a single model, LLaVA - NeXT - Interleave has achieved leading results in different multi - image tasks while maintaining the performance of single - image tasks. - **Emerging ability of cross - task transfer**: By jointly training diverse tasks, the model demonstrates the emerging ability to transfer tasks between different settings and modalities such as from image to video. These contributions not only solve the limitations of current multimodal models in multi - image scenarios, but also provide a solid foundation for future research and promote the development of multimodal artificial intelligence.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Improved Baselines with Visual Instruction Tuning

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

LLMs Meet Long Video: Advancing Long Video Comprehension with an Interactive Visual Adapter in LLMs.

InfMLLM: A Unified Framework for Visual-Language Tasks.

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding