Video Instruction Tuning With Synthetic Data

Yuanhan Zhang,Jinming Wu,Wei Li,Bo Li,Zejun Ma,Ziwei Liu,Chunyuan Li

2024-10-04

Abstract:The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address a key issue in the development of large-scale multimodal models (LMMs) for videos: the difficulty in obtaining large amounts of high-quality raw data. Specifically, existing video language instruction-following datasets have the following shortcomings: 1. **Low Video Quality**: Most videos in existing datasets are relatively static, lacking significant temporal changes, and thus fail to provide rich knowledge. 2. **Simplified Plot**: Videos in existing datasets are often clipped based on scene changes, leading to simplified plots that are insufficient for models to understand complex narratives. 3. **Low Frame Sampling Rate**: Existing datasets use very sparse sampling rates for frame annotations, such as sampling only 2 frames every 30 seconds. This results in an inability to capture detailed actions or changes in the video, leading to hallucinations when detailed descriptions are needed. To overcome these shortcomings, the authors propose a new high-quality synthetic dataset—**LLaVA-Video-178K**. This dataset contains 178,510 videos, each ranging from 0 to 3 minutes in length, and is accompanied by detailed annotations, open-ended questions, and multiple-choice questions. By training on this dataset, combined with existing visual instruction-tuning data, the authors developed a new video LMM—**LLaVA-Video**. Experimental results show that **LLaVA-Video** performs excellently on various video benchmarks, highlighting the effectiveness of the dataset. The authors plan to release the dataset, generation pipeline, and model checkpoints.

Video Instruction Tuning With Synthetic Data

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Improved Baselines with Visual Instruction Tuning

Generative Visual Instruction Tuning

SVIT: Scaling up Visual Instruction Tuning

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Aligning Large Multi-Modal Model with Robust Instruction Tuning

COCO is "ALL'' You Need for Visual Instruction Fine-tuning

Vision-Language Instruction Tuning: A Review and Analysis

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

VIGC: Visual Instruction Generation and Correction

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks