Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Shuai Yang,Yifan Zhou,Ziwei Liu,Chen Change Loy
DOI: https://doi.org/10.48550/arXiv.2306.07954
2023-06-13
Computer Vision and Pattern Recognition
Abstract:Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the challenges encountered when applying existing text-to-image diffusion models to the video domain, particularly the difficulty in maintaining consistency between video frames. Specifically, the paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to video applications. This framework consists of two parts: 1. **Key Frame Translation**: Utilizes an improved diffusion model to generate key frames and ensures inter-frame coherence through hierarchical cross-frame constraints (such as shape, texture, and color consistency). 2. **Full Video Translation**: Propagates the generated key frames to other frames using temporally aware patch matching and frame fusion techniques. This method does not require retraining or optimization and can efficiently generate coherent video frames without sacrificing quality. Additionally, the framework is compatible with existing image diffusion techniques, such as LoRA and ControlNet, thereby leveraging the advantages of these technologies. ### Main Contributions 1. Proposes a novel zero-shot framework for text-guided video-to-video translation, achieving both global and local temporal consistency without the need for training, and is compatible with pre-trained image diffusion models. 2. Introduces hierarchical cross-frame consistency constraints to achieve temporal consistency in terms of shape, texture, and color, enabling image diffusion models to better adapt to video applications. 3. Combines diffusion model-based generation with patch-based propagation to balance quality and efficiency. Experimental results validate the effectiveness of this framework in generating high-quality and temporally coherent videos.