Abstract:Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.

Improving Learned Video Compression by Exploring Spatial Redundancy

Foreground-Background Parallel Compression with Residual Encoding for Surveillance Video

Spatial Decomposition and Temporal Fusion based Inter Prediction for Learned Video Compression

Learned Video Compression with Adaptive Temporal Prior and Decoded Motion-aided Quality Enhancement

Temporal context video compression with flow-guided feature prediction

Spatial-Temporal Transformer based Video Compression Framework

Learning-Based Video Compression Framework With Implicit Spatial Transform for Applications in the Internet of Things

Adaptive Prediction Structure for Learned Video Compression

Enhancing Temporal Context for Learned Video Compression

Exploring Long- and Short-Range Temporal Information for Learned Video Compression

Learning Image and Video Compression through Spatial-Temporal Energy Compaction

Accelerating Learned Video Compression via Low-Resolution Representation Learning

Neural Video Compression using Spatio-Temporal Priors

Exploring Spatiotemporal Relationships for Improving Compressed Video Quality

A New Framework Based on Spatio-Temporal Information for Enhancing Compressed Video

Learned Video Compression With Efficient Temporal Context Learning

Video Compression Artifact Reduction via Spatio-Temporal Multi-Hypothesis Prediction.

Learned Video Compression Via Joint Spatial-Temporal Correlation Exploration

High Efficiency Deep-learning Based Video Compression

Learning-Based End-to-End Video Compression with Spatial-Temporal Adaptation.