Abstract:The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

What problem does this paper attempt to address?

The main goal of this paper is to address the issues related to the quality of datasets used in current text-to-video generation model training. Specifically, the existing datasets have the following major problems: 1. **Low-quality subtitles**: The subtitles in existing datasets have poor consistency with the video content and lack detailed descriptions. The average subtitle length is less than 15 words, which is insufficient to capture dynamic information such as actions, object movements, and camera transitions in the video. 2. **Low-quality videos**: The quality of videos in existing datasets is not high, making it difficult for models trained on these datasets to generate high-quality videos. 3. **Temporal inconsistency**: Video scene segmentation algorithms fail to accurately detect scene transitions within videos, leading to unstable model training. 4. **Data imbalance**: Current datasets are mainly composed of videos collected from the internet, with indoor human activity scenes dominating, resulting in an imbalanced data distribution. To address these issues, the authors propose a large-scale dataset named VidGen-1M, which contains 1 million video clips and their corresponding detailed Descriptive Synthetic Captions (DSC). These captions not only ensure stronger text-video alignment but also accurately capture dynamic elements in the videos. Additionally, the paper introduces a multi-stage data refinement method, including coarse-grained filtering, video caption generation, and fine-grained filtering steps, to ensure the high quality of the dataset. By training text-to-video generation models on the VidGen-1M dataset, experimental results show that the model's performance significantly surpasses other existing methods. In summary, the main contributions of this paper are the introduction of a high-quality video dataset, an effective multi-stage data refinement method, and a high-performance text-to-video generation model trained on this dataset.

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Distilling Vision-Language Models on Millions of Videos

A dataset of text prompts, videos and video quality metrics from generative text-to-video AI models

DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions