Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Mingxiang Liao,Hannan Lu,Xinyu Zhang,Fang Wan,Tianyu Wang,Yuzhong Zhao,Wangmeng Zuo,Qixiang Ye,Jingdong Wang
2024-07-01
Abstract:Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at <a class="link-external link-https" href="https://github.com/MingXiangL/DEVIL" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the insufficient focus on video dynamics in existing text-to-video (T2V) generation model evaluation protocols. Specifically, current evaluation protocols mainly concentrate on temporal consistency and content continuity, while neglecting the dynamics of video content, i.e., the degree to which video content changes over time, including object motion, action diversity, scene transitions, etc. Dynamics are crucial for measuring the visual vividness of videos and the authenticity of video content in response to text prompts. Therefore, the paper proposes a new evaluation protocol—DEVIL (Dynamics Evaluation of Video Inference and Learning), specifically designed to assess the performance of T2V models in terms of dynamics. The paper addresses this issue through the following aspects: 1. **Establishing a new benchmark**: Constructing a new benchmark that includes text prompts with various levels of dynamics, which can comprehensively reflect different levels from static to highly dynamic. 2. **Defining dynamic scores**: Proposing dynamic scoring metrics at multiple temporal granularities, including inter-frame dynamic scores, inter-segment dynamic scores, and video-level dynamic scores, to comprehensively evaluate the dynamics of each generated video. 3. **Designing evaluation metrics**: Based on the new benchmark and dynamic scores, designing three evaluation metrics: Dynamics Range, Dynamics Controllability, and Dynamics-based Quality. These metrics are used to evaluate the range of dynamic changes in the generated videos, the ability to control video dynamics according to text prompts, and the visual quality of videos at different dynamic levels, respectively. 4. **Experimental validation**: Validating the effectiveness of the DEVIL protocol through experiments, demonstrating a high correlation between DEVIL and human ratings, and proving its potential in advancing the development of T2V generation models. Through these methods, the paper aims to fill the gap in dynamic evaluation in existing evaluation protocols and promote the further development of T2V generation models.