Abstract:Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the challenges encountered when applying existing text-to-image diffusion models to the video domain, particularly the difficulty in maintaining consistency between video frames. Specifically, the paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to video applications. This framework consists of two parts: 1. **Key Frame Translation**: Utilizes an improved diffusion model to generate key frames and ensures inter-frame coherence through hierarchical cross-frame constraints (such as shape, texture, and color consistency). 2. **Full Video Translation**: Propagates the generated key frames to other frames using temporally aware patch matching and frame fusion techniques. This method does not require retraining or optimization and can efficiently generate coherent video frames without sacrificing quality. Additionally, the framework is compatible with existing image diffusion techniques, such as LoRA and ControlNet, thereby leveraging the advantages of these technologies. ### Main Contributions 1. Proposes a novel zero-shot framework for text-guided video-to-video translation, achieving both global and local temporal consistency without the need for training, and is compatible with pre-trained image diffusion models. 2. Introduces hierarchical cross-frame consistency constraints to achieve temporal consistency in terms of shape, texture, and color, enabling image diffusion models to better adapt to video applications. 3. Combines diffusion model-based generation with patch-based propagation to balance quality and efficiency. Experimental results validate the effectiveness of this framework in generating high-quality and temporally coherent videos.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

Efficient and consistent zero-shot video generation with diffusion models

Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Video-to-Video Translation with Global Temporal Consistency.

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Pix2Video: Video Editing using Image Diffusion

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing