Abstract:We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to deploy large - scale video diffusion models on mobile devices to achieve real - time text - to - video generation. Specifically, the authors propose a comprehensive acceleration framework, aiming to make video generation run efficiently on mobile devices, thereby expanding its application scope among content creators. ### Main Problems 1. **High Computational and Memory Requirements**: Current text - to - video diffusion models require more computational resources and memory than most mobile devices can handle. 2. **Slow Generation Speed**: Existing video generation models usually take tens of seconds or even several minutes to generate a video, which limits the possibility of their real - time applications. ### Solutions To address the above challenges, the authors propose a three - stage acceleration framework: 1. **Pruning from Pretrained Text - to - Image Models to Obtain an Efficient Spatio - Temporal Backbone Network**: - By pruning pretrained text - to - image diffusion models (such as Stable Diffusion v1.5), the authors obtain smaller and faster spatio - temporal backbone networks. 2. **Introducing a New Time - Layer Design and Conducting Joint Architecture Search**: - The authors systematically study different types of time layers (such as 1D convolution, 3D convolution, self - attention mechanism, etc.), and determine the optimal spatio - temporal architecture through latency - and memory - guided architecture search. 3. **Adversarial Fine - Tuning to Further Accelerate the Generation Process**: - Through the adversarial fine - tuning method, the authors reduce the denoising steps from 25 to 4 and eliminate classifier - free guidance, thus significantly improving the generation speed. ### Experimental Results - **Performance Comparison**: Compared with existing server - side models, this model can generate a 5 - second high - quality video on an iPhone 16 Pro Max within 5 seconds, while server - side models require several minutes. - **Quantitative Evaluation**: In the VBench benchmark test, this model achieves a higher total score while remaining compact, especially performing excellently in terms of dynamism, motion smoothness, and aesthetic quality. ### Contribution Summary - Proposed a comprehensive mobile acceleration framework for efficient text - to - video diffusion models. - Developed an adversarial fine - tuning technique for video diffusion models, enabling them to generate high - quality videos within 4 denoising steps. - For the first time, demonstrated the possibility of real - time text - to - video generation on mobile devices, paving the way for the large - scale application of video diffusion models. Through these innovations, the authors not only solve the deployment problem of video generation models on mobile devices but also significantly improve their generation efficiency and quality.

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Efficient and consistent zero-shot video generation with diffusion models

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Mobile Video Diffusion

SF-V: Single Forward Video Generation Model

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Accelerating Video Diffusion Models via Distribution Matching

OSV: One Step is Enough for High-Quality Image to Video Generation

VEnhancer: Generative Space-Time Enhancement for Video Generation

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis

Imagen Video: High Definition Video Generation with Diffusion Models

Video Diffusion Models