Video Instruction Tuning With Synthetic Data

Yuanhan Zhang,Jinming Wu,Wei Li,Bo Li,Zejun Ma,Ziwei Liu,Chunyuan Li
2024-10-04
Abstract:The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address a key issue in the development of large-scale multimodal models (LMMs) for videos: the difficulty in obtaining large amounts of high-quality raw data. Specifically, existing video language instruction-following datasets have the following shortcomings: 1. **Low Video Quality**: Most videos in existing datasets are relatively static, lacking significant temporal changes, and thus fail to provide rich knowledge. 2. **Simplified Plot**: Videos in existing datasets are often clipped based on scene changes, leading to simplified plots that are insufficient for models to understand complex narratives. 3. **Low Frame Sampling Rate**: Existing datasets use very sparse sampling rates for frame annotations, such as sampling only 2 frames every 30 seconds. This results in an inability to capture detailed actions or changes in the video, leading to hallucinations when detailed descriptions are needed. To overcome these shortcomings, the authors propose a new high-quality synthetic dataset—**LLaVA-Video-178K**. This dataset contains 178,510 videos, each ranging from 0 to 3 minutes in length, and is accompanied by detailed annotations, open-ended questions, and multiple-choice questions. By training on this dataset, combined with existing visual instruction-tuning data, the authors developed a new video LMM—**LLaVA-Video**. Experimental results show that **LLaVA-Video** performs excellently on various video benchmarks, highlighting the effectiveness of the dataset. The authors plan to release the dataset, generation pipeline, and model checkpoints.