Abstract:Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve $\textbf{both fast and high-quality video generation}$. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the quality bottleneck issue in Video Consistency Models (VCM) when generating high-quality videos. Specifically, although existing text-to-video (T2V) models based on diffusion models have achieved significant success, their iterative sampling process is slow, limiting the capability for real-time applications. On the other hand, while Consistency Distillation (CD) methods can distill VCM from teacher T2V models for fast inference, this acceleration comes at the cost of sample quality. To break this quality bottleneck, the paper introduces a new method—T2V-Turbo. This method optimizes the quality of single-step generation by integrating feedback from multiple differentiable Reward Models (RM) into the consistency distillation process, thereby improving video generation quality while maintaining high-speed inference. ### Main Contributions 1. **Integration of Feedback from Multiple Reward Models**: T2V-Turbo integrates feedback from image-text and video-text reward models, doing so for the first time. 2. **Achieving New Benchmarks in 4 Inference Steps**: T2V-Turbo establishes new best performance on the video evaluation benchmark VBench with only 4 inference steps, surpassing proprietary models trained with extensive resources. 3. **Excellent Human Evaluation Results**: Through human evaluation, the 4-step generated T2V-Turbo videos outperform the 50-step generated videos of its teacher T2V model in terms of quality and text alignment, achieving over 10 times inference acceleration and quality improvement. ### Method Overview - **Consistency Distillation (CD)**: Distilling VCM from a pre-trained T2V model by minimizing CD loss to optimize the model. - **Mixed Reward Feedback**: During the CD process, optimizing the single-step generated video through backpropagation gradients to better align with human preferences under the guidance of multiple differentiable reward models. - **Image-Text Reward Model**: Used to optimize human preferences for each video frame. - **Video-Text Reward Model**: Used to evaluate the temporal dynamics and inter-frame transitions of the generated video. ### Experimental Results - **Automatic Evaluation**: On the standard video evaluation benchmark VBench, the 4-step generated T2V-Turbo outperforms all baseline methods, including proprietary systems Gen-2 and Pika, in overall score, quality score, and semantic score. - **Human Evaluation**: Through human evaluation with 700 EvalCrafter benchmark prompts, the 4-step generated T2V-Turbo outperforms the 50-step generated videos of its teacher T2V model in visual quality, text-video alignment, and overall preference, achieving significant performance improvement and inference acceleration. ### Conclusion T2V-Turbo successfully breaks the quality bottleneck of VCM by integrating feedback from multiple reward models, achieving high-speed and high-quality video generation. This method not only performs excellently in automatic evaluations but is also validated by human evaluations, demonstrating its potential for practical applications.

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

OSV: One Step is Enough for High-Quality Image to Video Generation

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Zero-Shot Video Editing through Adaptive Sliding Score Distillation

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

STIV: Scalable Text and Image Conditioned Video Generation

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis