VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

Jiahao Hu,Tianxiong Zhong,Xuebo Wang,Boyuan Jiang,Xingye Tian,Fei Yang,Pengfei Wan,Di Zhang
2024-11-22
Abstract:Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing datasets based on real-world data, as constructing such datasets is both time-consuming and costly. Moreover, video data requires a significantly larger number of tokens for representation, which substantially increases the training costs for video editing models. Lastly, current video editing models offer limited interactivity, often making it difficult for users to express their editing requirements effectively in a single attempt. To address these challenges, this paper introduces a dataset VIVID-10M and a baseline model VIVID. VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs, which comprises 9.7M samples that encompass a wide range of video editing tasks. VIVID is a Versatile and Interactive VIdeo local eDiting model trained on VIVID-10M, which supports entity addition, modification, and deletion. At its core, a keyframe-guided interactive video editing mechanism is proposed, enabling users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes. Extensive experimental evaluations show that our approach achieves state-of-the-art performance in video local editing, surpassing baseline methods in both automated metrics and user studies. The VIVID-10M dataset and the VIVID editing model will be available at \url{<a class="link-external link-https" href="https://inkosizhong.github.io/VIVID/" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the following three main problems: 1. **Lack of large - scale and high - quality video editing datasets**: - Building a large - scale, open - source video editing dataset based on real - world data is both time - consuming and expensive. Existing training methods often rely on synthetic data, which limits the performance of the model. - To solve this problem, the paper introduced a dataset named VIVID - 10M, which is the first large - scale mixed image and video local editing dataset, containing 9.7 million samples and covering a wide range of video editing tasks. 2. **High training cost**: - Video editing models need to represent a large amount of time - series data, which significantly increases the training cost. Compared with image editing models, video editing models have lower training efficiency. - The paper reduces the training overhead by combining image and video data for joint training, thus improving the training efficiency and reducing the cost. 3. **Limited interactivity**: - It is difficult for users to accurately express their editing requirements in one attempt, resulting in the need for multiple iterative adjustments, which increases the inference time and resource consumption. - For this reason, the paper proposes an interactive video editing mechanism (KIVE) based on key - frame guidance, enabling users to quickly edit key frames and propagate these editing results to other frames, thus improving user interactivity and editing efficiency. ### Overview of solutions To address the above challenges, the paper proposes the following solutions: - **VIVID - 10M dataset**: A large - scale mixed image and video local editing dataset containing 9.7 million samples, aiming to reduce the cost of data construction and model training. - **VIVID model**: A multi - functional and interactive video local editing model that supports entity addition, modification, and deletion. This model is trained on VIVID - 10M and adopts the key - frame - guided interactive editing mechanism (KIVE) to improve user interactivity and editing efficiency. - **KIVE mechanism**: An interactive video editing mechanism based on key - frame guidance, allowing users to quickly edit key frames and propagate these editing results to other frames, thus reducing the time and resources required to achieve satisfactory results. Through these innovations, the paper demonstrates the latest performance in video local editing tasks, surpassing the performance of existing methods in both automated metrics and user studies.