Technical Report: Competition Solution For Modelscope-Sora

Shengfu Chen,Hailong Liu,Wenzhao Wei
2024-09-24
Abstract:This report presents the approach adopted in the Modelscope-Sora challenge, which focuses on fine-tuning data for video generation models. The challenge evaluates participants' ability to analyze, clean, and generate high-quality datasets for video-based text-to-video tasks under specific computational constraints. The provided methodology involves data processing techniques such as video description generation, filtering, and acceleration. This report outlines the procedures and tools utilized to enhance the quality of training data, ensuring improved performance in text-to-video generation models.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main goal of this paper is to propose a data processing method in the Modelscope-Sora challenge to optimize the training dataset used for text-to-video generation tasks. Specifically, this paper aims to address the following key issues: 1. **Construction of High-Quality Dataset**: To improve the performance of text-to-video generation models, a high-quality dataset needs to be constructed. The paper details how to enhance the quality of the dataset through techniques such as video description generation, data filtering (e.g., character redundancy filtering, frame and text similarity filtering, video aesthetics filtering, etc.), and video acceleration. 2. **Data Preprocessing and Cleaning**: In the actual training process, the original dataset contains some low-quality or irrelevant video materials. Therefore, researchers adopted a series of preprocessing steps to clean these data, ensuring that the final training samples are both accurate and relevant. 3. **Model Fine-Tuning Under Computational Resource Constraints**: The challenge requires participants to fine-tune Sora-like models under specific computational constraints (e.g., pixel count limits under different resolution settings). The paper discusses how to effectively utilize limited computational resources to optimize model performance under these constraints. 4. **Ensuring Consistency Between Text and Video Content**: By optimizing the text descriptions in the training data, the paper ensures that the generated video content maintains semantic consistency and visual coordination with the input text. In summary, this paper aims to improve the performance of models in text-to-video generation tasks through a series of data processing techniques and methods, achieving significant results in generating high-quality and contextually relevant video outputs.