InstructVideo: Instructing Video Diffusion Models with Human Feedback

Hangjie Yuan,Shiwei Zhang,Xiang Wang,Yujie Wei,Tao Feng,Yining Pan,Yingya Zhang,Ziwei Liu,Samuel Albanie,Dong Ni

2023-12-20

Abstract:Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issues of poor visual quality and mismatch with text prompts in video generation models, especially those based on diffusion models. Specifically: 1. **Poor Visual Quality**: Existing video generation models often produce videos with poor visual effects due to pre-training on large-scale, uneven-quality internet data, which can even contain toxic or inappropriate content. 2. **Mismatch with Text Prompts**: The generated videos often do not align with the text prompts provided by users, affecting the model's usability and user experience. To tackle these issues, the paper proposes the **InstructVideo** model, which employs human feedback to fine-tune the text-to-video diffusion model through reward tuning. The main contributions of InstructVideo include: 1. **Redefining Reward Tuning as an Editing Process**: By partially inferring the DDIM sampling chain instead of fully generating the video, it reduces computational costs and improves fine-tuning efficiency. 2. **Introducing Segmental Video Reward (SegVR) and Temporally Attenuated Reward (TAR)**: Utilizing existing image reward models to evaluate video quality and optimizing the video generation process through sparse sampling and temporal attenuation strategies, thereby enhancing the quality and coherence of the generated videos. Through these methods, InstructVideo significantly improves the visual quality of generated videos while maintaining the model's generalization capability.

InstructVideo: Instructing Video Diffusion Models with Human Feedback

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Video Diffusion Alignment via Reward Gradients

Video Diffusion Models

Imagen Video: High Definition Video Generation with Diffusion Models

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

From Slow Bidirectional to Fast Causal Video Generators

OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

HARIVO: Harnessing Text-to-Image Models for Video Generation

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Diffusion Reward: Learning Rewards via Conditional Video Diffusion

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Video Diffusion Transformers are In-Context Learners