Abstract:Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.

Compact Temporal Trajectory Representation for Talking Face Video Compression

Beyond Keypoint Coding: Temporal Evolution Inference with Compact Feature Representation for Talking Face Video Compression

Generative Human Video Compression with Multi-granularity Temporal Trajectory Factorization

Compressing Video Calls using Synthetic Talking Heads

Audio-driven Talking Face Video Generation with Natural Head Pose

Temporal context video compression with flow-guided feature prediction

Interactive Face Video Coding: A Generative Compression Framework

Compact Representation for Dynamic Texture Video Coding Using Tensor Method.

Toward Fine-Grained Talking Face Generation

DAVD-Net: Deep Audio-Aided Video Decompression of Talking Heads

Model-based portrait video compression with spatial constraint and adaptive pose processing

Dynamic Multi-Reference Generative Prediction for Face Video Compression.

From Visual Search to Video Compression: A Compact Representation Framework for Video Feature Descriptors.

Disentangled Visual Representations for Extreme Human Body Video Compression

Towards Analysis-Friendly Face Representation with Scalable Feature and Texture Compression

Hierarchical Coding for Talking-Head Video

Spatial-Temporal Transformer based Video Compression Framework

Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

Face Region Based Conversational Video Coding

Extreme Generative Human-Oriented Video Coding Via Motion Representation Compression.

Beyond GFVC: A Progressive Face Video Compression Framework with Adaptive Visual Tokens