Abstract:The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on \textbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$ and 1.4 $\times$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.

Mobile Video Diffusion

MoViE: Mobile Diffusion for Video Editing

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Video Diffusion Models

Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

Efficiency-optimized Video Diffusion Models

MV-Diffusion: Motion-aware Video Diffusion Model

SF-V: Single Forward Video Generation Model

Progressive Autoregressive Video Diffusion Models

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Video Diffusion Models with Local-Global Context Guidance

MobileNVC: Real-time 1080p Neural Video Compression on a Mobile Device

SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

Video Diffusion Models are Training-free Motion Interpreter and Controller

Squeezing Large-Scale Diffusion Models for Mobile

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

Diffusion Models for Video Prediction and Infilling

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device