VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Zhiyu Tan,Xiaomeng Yang,Luozheng Qin,Hao Li
2024-08-06
Abstract:The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to address the issues related to the quality of datasets used in current text-to-video generation model training. Specifically, the existing datasets have the following major problems: 1. **Low-quality subtitles**: The subtitles in existing datasets have poor consistency with the video content and lack detailed descriptions. The average subtitle length is less than 15 words, which is insufficient to capture dynamic information such as actions, object movements, and camera transitions in the video. 2. **Low-quality videos**: The quality of videos in existing datasets is not high, making it difficult for models trained on these datasets to generate high-quality videos. 3. **Temporal inconsistency**: Video scene segmentation algorithms fail to accurately detect scene transitions within videos, leading to unstable model training. 4. **Data imbalance**: Current datasets are mainly composed of videos collected from the internet, with indoor human activity scenes dominating, resulting in an imbalanced data distribution. To address these issues, the authors propose a large-scale dataset named VidGen-1M, which contains 1 million video clips and their corresponding detailed Descriptive Synthetic Captions (DSC). These captions not only ensure stronger text-video alignment but also accurately capture dynamic elements in the videos. Additionally, the paper introduces a multi-stage data refinement method, including coarse-grained filtering, video caption generation, and fine-grained filtering steps, to ensure the high quality of the dataset. By training text-to-video generation models on the VidGen-1M dataset, experimental results show that the model's performance significantly surpasses other existing methods. In summary, the main contributions of this paper are the introduction of a high-quality video dataset, an effective multi-stage data refinement method, and a high-performance text-to-video generation model trained on this dataset.