TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Haomiao Ni,Bernhard Egger,Suhas Lohit,Anoop Cherian,Ye Wang,Toshiaki Koike-Akino,Sharon X. Huang,Tim K. Marks

2024-04-25

Abstract:Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the problem of Text-to-Image-to-Video generation (TI2V) under text conditions. Existing TI2V frameworks typically require extensive training on video-text datasets and specific model designs to support text and image conditions. This paper proposes a method called TI2V-Zero, a zero-shot, fine-tuning-free approach that leverages a pre-trained Text-to-Video diffusion model (T2V) to achieve video generation under image conditions. Specifically, the main features of the TI2V-Zero method are as follows: 1. **Zero-shot generation**: This method can directly utilize the pre-trained T2V diffusion model for video generation under image conditions without any optimization or introduction of external modules. 2. **"Repeat and Slide" strategy**: By embedding the input image into the output latent codes during the reverse denoising process, it ensures that the generated video frames are produced frame by frame and maintain temporal continuity. 3. **DDPM inversion strategy**: To provide more suitable initial noise for generating each new frame, an inversion strategy based on the DDPM forward process is adopted. 4. **Resampling technique**: A resampling technique is applied in the video diffusion model to help preserve the generated visual details. Experimental results show that TI2V-Zero outperforms existing open-domain TI2V models on multiple datasets and can seamlessly extend to other tasks such as video inpainting and prediction.

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

ED-T2V: an Efficient Training Framework for Diffusion-based Text-to-Video Generation.

STIV: Scalable Text and Image Conditioned Video Generation

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Zero-Shot Video Editing through Adaptive Sliding Score Distillation

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Fine-gained Zero-shot Video Sampling

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Text to Video Generation Via Knowledge Distillation

HARIVO: Harnessing Text-to-Image Models for Video Generation

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models