Abstract:Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30-59 FPS and saves 28-35× computational cost on a single V100 GPU. Code and models are publicly available.

Video Frame Synthesis using Deep Voxel Flow

Video Frame Interpolation Using Recurrent Convolutional Layers

FRAME INTERPOLATION VIA REFINED DEEP VOXEL FLOW

Multiframe Interpolation for Video Using Phase Features

Video Frame Synthesis Via Plug-and-Play Deep Locally Temporal Embedding.

Flow-based Frame Interpolation Networks Combined with Occlusion-Aware Mask Estimation.

Neighbor Correspondence Matching for Flow-based Video Frame Synthesis.

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Flow Video Synthesis from an Image.

Video Frame Interpolation with Densely Queried Bilateral Correlation

Video Frame Interpolation without Temporal Priors

Mixed Neural Voxels for Fast Multi-view Video Synthesis

Splatting-based Synthesis for Video Frame Interpolation

Depth-Aware Video Frame Interpolation

Motion-Aware Video Frame Interpolation

Video Frame Interpolation Via Residue Refinement.

Frame Interpolation with Consecutive Brownian Bridge Diffusion

Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis

Dynamic Frame Interpolation in Wavelet Domain

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

Frame-Recurrent Video Inpainting by Robust Optical Flow Inference