InstructVideo: Instructing Video Diffusion Models with Human Feedback

Hangjie Yuan,Shiwei Zhang,Xiang Wang,Yujie Wei,Tao Feng,Yining Pan,Yingya Zhang,Ziwei Liu,Samuel Albanie,Dong Ni
2023-12-20
Abstract:Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issues of poor visual quality and mismatch with text prompts in video generation models, especially those based on diffusion models. Specifically: 1. **Poor Visual Quality**: Existing video generation models often produce videos with poor visual effects due to pre-training on large-scale, uneven-quality internet data, which can even contain toxic or inappropriate content. 2. **Mismatch with Text Prompts**: The generated videos often do not align with the text prompts provided by users, affecting the model's usability and user experience. To tackle these issues, the paper proposes the **InstructVideo** model, which employs human feedback to fine-tune the text-to-video diffusion model through reward tuning. The main contributions of InstructVideo include: 1. **Redefining Reward Tuning as an Editing Process**: By partially inferring the DDIM sampling chain instead of fully generating the video, it reduces computational costs and improves fine-tuning efficiency. 2. **Introducing Segmental Video Reward (SegVR) and Temporally Attenuated Reward (TAR)**: Utilizing existing image reward models to evaluate video quality and optimizing the video generation process through sparse sampling and temporal attenuation strategies, thereby enhancing the quality and coherence of the generated videos. Through these methods, InstructVideo significantly improves the visual quality of generated videos while maintaining the model's generalization capability.