Technical Report: Competition Solution For Modelscope-Sora

Shengfu Chen,Hailong Liu,Wenzhao Wei

2024-09-24

Abstract:This report presents the approach adopted in the Modelscope-Sora challenge, which focuses on fine-tuning data for video generation models. The challenge evaluates participants' ability to analyze, clean, and generate high-quality datasets for video-based text-to-video tasks under specific computational constraints. The provided methodology involves data processing techniques such as video description generation, filtering, and acceleration. This report outlines the procedures and tools utilized to enhance the quality of training data, ensuring improved performance in text-to-video generation models.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The main goal of this paper is to propose a data processing method in the Modelscope-Sora challenge to optimize the training dataset used for text-to-video generation tasks. Specifically, this paper aims to address the following key issues: 1. **Construction of High-Quality Dataset**: To improve the performance of text-to-video generation models, a high-quality dataset needs to be constructed. The paper details how to enhance the quality of the dataset through techniques such as video description generation, data filtering (e.g., character redundancy filtering, frame and text similarity filtering, video aesthetics filtering, etc.), and video acceleration. 2. **Data Preprocessing and Cleaning**: In the actual training process, the original dataset contains some low-quality or irrelevant video materials. Therefore, researchers adopted a series of preprocessing steps to clean these data, ensuring that the final training samples are both accurate and relevant. 3. **Model Fine-Tuning Under Computational Resource Constraints**: The challenge requires participants to fine-tune Sora-like models under specific computational constraints (e.g., pixel count limits under different resolution settings). The paper discusses how to effectively utilize limited computational resources to optimize model performance under these constraints. 4. **Ensuring Consistency Between Text and Video Content**: By optimizing the text descriptions in the training data, the paper ensures that the generated video content maintains semantic consistency and visual coordination with the input text. In summary, this paper aims to improve the performance of models in text-to-video generation tasks through a series of data processing techniques and methods, achieving significant results in generating high-quality and contextually relevant video outputs.

Technical Report: Competition Solution For Modelscope-Sora

Technical Report: Competition Solution For BetterMixture

ModelScope Text-to-Video Technical Report

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

Ultra Light OCR Competition Technical Report

Team SPEEDY Multi Moments in Time Challenge 2019 Technical Report

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

From Sora What We Can See: A Survey of Text-to-Video Generation

Open-Sora Plan: Open-Source Large Video Generation Model

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

Technical Report of 2023 ABO Fine-grained Semantic Segmentation Competition

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid Reward Strategies for Video Captioning

Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

Pre-training for Video Captioning Challenge 2020 Summary

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

Proposal Report for the 2nd SciCAP Competition 2024

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

The Solution for the CVPR2024 NICE Image Captioning Challenge

LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation