Abstract:Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.

Temporal Enhanced Hybrid Neural Representation for Video Compression

HNeRV: A Hybrid Neural Representation for Videos

DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes

Towards Scalable Neural Representation for Diverse Videos

High-Frequency Enhanced Hybrid Neural Representation for Video Compression

NeRV: Neural Representations for Videos

Boosting Neural Representations for Videos with a Conditional Decoder

VQ-NeRV: A Vector Quantized Neural Representation for Videos

NERV++: An Enhanced Implicit Neural Video Representation

DNeRV: Modeling Inherent Dynamics Via Difference Neural Representation for Videos.

VQNeRV: Vector Quantization Neural Representation for Video Compression

E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context

MNeRV: A Multilayer Neural Representation for Videos

Fast Encoding and Decoding for Implicit Video Representation

HiNeRV: Video Compression with Hierarchical Encoding-based Neural Representation

End-to-end Neural Video Coding Using a Compound Spatiotemporal Representation

Immersive Video Compression using Implicit Neural Representations

SSNVC: Single Stream Neural Video Compression with Implicit Temporal Information

PNeRV: A Polynomial Neural Representation for Videos

FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos

Spatial-Temporal Transformer based Video Compression Framework