Abstract:Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30-59 FPS and saves 28-35× computational cost on a single V100 GPU. Code and models are publicly available.

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

FastBlend: a Powerful Model-Free Toolkit Making Video Stylization Easier

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

Object-Centric Diffusion for Efficient Video Editing

Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis

Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

LADDER: An Efficient Framework for Video Frame Interpolation

Streaming Video Diffusion: Online Video Editing with Diffusion Models

DeepFaceVideoEditing

Pix2Video: Video Editing using Image Diffusion

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping