Abstract:Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30-59 FPS and saves 28-35× computational cost on a single V100 GPU. Code and models are publicly available.

Collaborative spatial-temporal distillation for efficient video deraining

Self-Paced Knowledge Distillation for Real-Time Image Guided Depth Completion

DCCD: Reducing Neural Network Redundancy Via Distillation

Collaborative Distillation for Ultra-Resolution Universal Style Transfer

Rain-Prior Injected Knowledge Distillation for Single Image Deraining.

Progressive Network Grafting for Few-Shot Knowledge Distillation

Enhanced Spatio-Temporal Interaction Learning for Video Deraining: A Faster and Better Framework

DistilE: Distiling Knowledge Graph Embeddings for Faster and Cheaper Reasoning

Delta Distillation for Efficient Video Processing

Exploring Graph-based Knowledge: Multi-Level Feature Distillation via Channels Relational Graph

Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis

Channel Distillation: Channel-Wise Attention for Knowledge Distillation

Channel-wise Knowledge Distillation for Dense Prediction

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

Triple-level Model Inferred Collaborative Network Architecture for Video Deraining

Channel-wise Distillation for Semantic Segmentation.

Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition

Pixel Distillation: Cost-flexible Distillation Across Image Sizes and Heterogeneous Networks

Structured Knowledge Distillation for Dense Prediction

Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval

Collaborative Knowledge Distillation