Abstract:Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at <a class="link-external link-https" href="https://github.com/LLaVA-VL/LLaVA-NeXT" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient ability of existing large - scale multimodal models (LMMs) when dealing with multi - image scenarios. Although the existing open - source LMMs have made remarkable progress in single - image tasks, there is still a lack of exploration in more complex multi - image scenario applications. In addition, previous studies usually train specific task models separately for different application scenarios, resulting in fragmented methods, low efficiency and difficulty in expansion. Therefore, the paper proposes a unified framework, aiming to enable LMMs to operate effectively across multiple visual scenarios, including multi - image, video (multi - frame), 3D (multi - view) and single - image (multi - block) scenarios, through a general data template - the image - text interleaving format.
Specifically, the main contributions of the paper include:
- **Unifying different tasks with interleaved data format**: Convert multi - image, video, 3D and single - image data into an interleaved training format, thereby unifying different tasks in one LMM.
- **New datasets and benchmarks**: Compile a high - quality training dataset M4 - Instruct, which contains 11.776 million samples, covering 14 tasks and 41 datasets in 4 major domains (multi - image, video, 3D and single - image). At the same time, also organize the LLaVA - Interleave Bench, a diverse set of benchmarks for evaluating multi - image performance, including 7 newly collected and 13 existing in - domain and out - domain benchmarks.
- **State - of - the - art performance**: Through a single model, LLaVA - NeXT - Interleave has achieved leading results in different multi - image tasks while maintaining the performance of single - image tasks.
- **Emerging ability of cross - task transfer**: By jointly training diverse tasks, the model demonstrates the emerging ability to transfer tasks between different settings and modalities such as from image to video.
These contributions not only solve the limitations of current multimodal models in multi - image scenarios, but also provide a solid foundation for future research and promote the development of multimodal artificial intelligence.