EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

Yupeng Chen,Penglin Chen,Xiaoyu Zhang,Yixian Huang,Qian Xie

2024-09-15

Abstract:The rapid development of diffusion models has significantly advanced AI-generated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models' performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models' effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model's strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address the lack of comprehensive evaluation benchmarks for text-driven video editing models. Despite the significant advancements in AI-generated content (AIGC) driven by the development of diffusion models in recent years, particularly in text-to-image (T2I) and text-to-video (T2V) generation, the field of text-driven video editing still lacks a benchmark that can comprehensively evaluate the performance of these models. Existing evaluation methods are usually limited to a few automatic metrics and often summarize overall performance with a single score, which obscures the models' performance on specific editing tasks. To fill this gap, the authors propose **EditBoard**, the first comprehensive evaluation benchmark for text-driven video editing models. EditBoard includes 9 automatic metrics covering 4 dimensions, evaluates the models' performance on 4 task categories, and introduces 3 new metrics to assess fidelity. This task-oriented benchmark not only helps in objectively evaluating model performance but also provides a detailed display of each model's strengths and weaknesses, thereby promoting the standardization and further development of video editing models.

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

EditEval: An Instruction-Based Benchmark for Text Improvements

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

VBench: Comprehensive Benchmark Suite for Video Generative Models

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

CVPR 2023 Text Guided Video Editing Competition

TVBench: Redesigning Video-Language Evaluation

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

EffiVED:Efficient Video Editing via Text-instruction Diffusion Models

A Benchmark for Controllable Text -Image-to-video Generation

CCEdit: Creative and Controllable Video Editing via Diffusion Models

Pioneering Reliable Assessment in Text-to-Image Knowledge Editing: Leveraging a Fine-Grained Dataset and an Innovative Criterion