Abstract:Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.

Light Field Image Compression Using Multi-branch Spatial Transformer Networks Based View Synthesis

Light Field Image Compression Using Generative Adversarial Network-Based View Synthesis

Light Field Image Compression Based on Deep Learning

Light Field Image Compression with Sub-apertures Reordering and Adaptive Reconstruction.

Light Field Compression Based on Implicit Neural Representation

High Efficiency Light Field Compression Via Virtual Reference And Hierarchical Mv-Hevc

Light Field Image Super-Resolution Network Based on Angular Difference Enhancement

View Position Prior-Supervised Light Field Angular Super-Resolution Network with Asymmetric Feature Extraction and Spatial-Angular Interaction.

Shearlet Transform based Light Field Compression Under Low Bitrates

Learning Kernel-Modulated Neural Representation for Efficient Light Field Compression

Light Field Video Compression and Real Time Rendering

Light-field view synthesis using convolutional block attention module

Light field image coding using a residual channel attention network–based view synthesis

Light field angular super-resolution based on intrinsic and geometric information.

Data Compression of Light Field Using Multiscale Edges

Light Field Image Compression Using Depth-Based Cnn In Intra Prediction

Light Field All-in-focus Image Fusion Based on Spatially-Guided Angular Information.

Surface Light Field Compression Using a Point Cloud Codec

Light Field Compression With Disparity-Guided Sparse Coding Based on Structural Key Views

Spatial-Temporal Transformer based Video Compression Framework

Pseudo-Sequence-Based Light Field Image Compression