EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

Yupeng Chen,Penglin Chen,Xiaoyu Zhang,Yixian Huang,Qian Xie
2024-09-15
Abstract:The rapid development of diffusion models has significantly advanced AI-generated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models' performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models' effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model's strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper aims to address the lack of comprehensive evaluation benchmarks for text-driven video editing models. Despite the significant advancements in AI-generated content (AIGC) driven by the development of diffusion models in recent years, particularly in text-to-image (T2I) and text-to-video (T2V) generation, the field of text-driven video editing still lacks a benchmark that can comprehensively evaluate the performance of these models. Existing evaluation methods are usually limited to a few automatic metrics and often summarize overall performance with a single score, which obscures the models' performance on specific editing tasks. To fill this gap, the authors propose **EditBoard**, the first comprehensive evaluation benchmark for text-driven video editing models. EditBoard includes 9 automatic metrics covering 4 dimensions, evaluates the models' performance on 4 task categories, and introduces 3 new metrics to assess fidelity. This task-oriented benchmark not only helps in objectively evaluating model performance but also provides a detailed display of each model's strengths and weaknesses, thereby promoting the standardization and further development of video editing models.