BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Fengyuan Shi,Jiaxi Gu,Hang Xu,Songcen Xu,Wei Zhang,Limin Wang

2024-04-09

Abstract:Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well preserved. Finally, these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as {\bf BIVDiff}, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models (e.g., VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video synthesis tasks, including controllable video generation, video editing, video inpainting, and outpainting.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem addressed in this paper is how to achieve general video synthesis without the need for additional training. Although existing text-driven image and video generation models have made progress, video synthesis tasks still face challenges such as high resource requirements, difficulty in maintaining temporal consistency, and limited task generalization ability. To solve this problem, the paper proposes a framework called BIVDiff, which bridges specific image diffusion models and general text-to-video diffusion models to achieve untrained video generation. The framework first uses the image diffusion model to generate videos frame by frame, then performs mixed inversion on the generated videos, and finally inputs the inverted latent variables into the video diffusion model for temporal smoothing, thereby obtaining temporally coherent videos. This approach not only has task generalization ability but also improves efficiency.

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

FrameBridge: Improving Image-to-Video Generation with Bridge Models

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

Pix2Video: Video Editing using Image Diffusion

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

ED-T2V: an Efficient Training Framework for Diffusion-based Text-to-Video Generation.

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Video Diffusion Models

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Structure and Content-Guided Video Synthesis with Diffusion Models

Video Diffusion Transformers are In-Context Learners

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

From Slow Bidirectional to Fast Causal Video Generators

Training-Free Semantic Video Composition via Pre-trained Diffusion Model