T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

Jiachen Li,Weixi Feng,Tsu-Jui Fu,Xinyi Wang,Sugato Basu,Wenhu Chen,William Yang Wang
2024-10-11
Abstract:Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve $\textbf{both fast and high-quality video generation}$. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the quality bottleneck issue in Video Consistency Models (VCM) when generating high-quality videos. Specifically, although existing text-to-video (T2V) models based on diffusion models have achieved significant success, their iterative sampling process is slow, limiting the capability for real-time applications. On the other hand, while Consistency Distillation (CD) methods can distill VCM from teacher T2V models for fast inference, this acceleration comes at the cost of sample quality. To break this quality bottleneck, the paper introduces a new method—T2V-Turbo. This method optimizes the quality of single-step generation by integrating feedback from multiple differentiable Reward Models (RM) into the consistency distillation process, thereby improving video generation quality while maintaining high-speed inference. ### Main Contributions 1. **Integration of Feedback from Multiple Reward Models**: T2V-Turbo integrates feedback from image-text and video-text reward models, doing so for the first time. 2. **Achieving New Benchmarks in 4 Inference Steps**: T2V-Turbo establishes new best performance on the video evaluation benchmark VBench with only 4 inference steps, surpassing proprietary models trained with extensive resources. 3. **Excellent Human Evaluation Results**: Through human evaluation, the 4-step generated T2V-Turbo videos outperform the 50-step generated videos of its teacher T2V model in terms of quality and text alignment, achieving over 10 times inference acceleration and quality improvement. ### Method Overview - **Consistency Distillation (CD)**: Distilling VCM from a pre-trained T2V model by minimizing CD loss to optimize the model. - **Mixed Reward Feedback**: During the CD process, optimizing the single-step generated video through backpropagation gradients to better align with human preferences under the guidance of multiple differentiable reward models. - **Image-Text Reward Model**: Used to optimize human preferences for each video frame. - **Video-Text Reward Model**: Used to evaluate the temporal dynamics and inter-frame transitions of the generated video. ### Experimental Results - **Automatic Evaluation**: On the standard video evaluation benchmark VBench, the 4-step generated T2V-Turbo outperforms all baseline methods, including proprietary systems Gen-2 and Pika, in overall score, quality score, and semantic score. - **Human Evaluation**: Through human evaluation with 700 EvalCrafter benchmark prompts, the 4-step generated T2V-Turbo outperforms the 50-step generated videos of its teacher T2V model in visual quality, text-video alignment, and overall preference, achieving significant performance improvement and inference acceleration. ### Conclusion T2V-Turbo successfully breaks the quality bottleneck of VCM by integrating feedback from multiple reward models, achieving high-speed and high-quality video generation. This method not only performs excellently in automatic evaluations but is also validated by human evaluations, demonstrating its potential for practical applications.