Abstract:Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at <a class="link-external link-https" href="https://github.com/MingXiangL/DEVIL" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is the insufficient focus on video dynamics in existing text-to-video (T2V) generation model evaluation protocols. Specifically, current evaluation protocols mainly concentrate on temporal consistency and content continuity, while neglecting the dynamics of video content, i.e., the degree to which video content changes over time, including object motion, action diversity, scene transitions, etc. Dynamics are crucial for measuring the visual vividness of videos and the authenticity of video content in response to text prompts. Therefore, the paper proposes a new evaluation protocol—DEVIL (Dynamics Evaluation of Video Inference and Learning), specifically designed to assess the performance of T2V models in terms of dynamics. The paper addresses this issue through the following aspects: 1. **Establishing a new benchmark**: Constructing a new benchmark that includes text prompts with various levels of dynamics, which can comprehensively reflect different levels from static to highly dynamic. 2. **Defining dynamic scores**: Proposing dynamic scoring metrics at multiple temporal granularities, including inter-frame dynamic scores, inter-segment dynamic scores, and video-level dynamic scores, to comprehensively evaluate the dynamics of each generated video. 3. **Designing evaluation metrics**: Based on the new benchmark and dynamic scores, designing three evaluation metrics: Dynamics Range, Dynamics Controllability, and Dynamics-based Quality. These metrics are used to evaluate the range of dynamic changes in the generated videos, the ability to control video dynamics according to text prompts, and the visual quality of videos at different dynamic levels, respectively. 4. **Experimental validation**: Validating the effectiveness of the DEVIL protocol through experiments, demonstrating a high correlation between DEVIL and human ratings, and proving its potential in advancing the development of T2V generation models. Through these methods, the paper aims to fill the gap in dynamic evaluation in existing evaluation protocols and promote the further development of T2V generation models.

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

Towards A Better Metric for Text-to-Video Generation

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Interactive Visual Assessment for Text-to-Image Generation Models

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Motion Control for Enhanced Complex Action Video Generation

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

A dataset of text prompts, videos and video quality metrics from generative text-to-video AI models