Abstract:Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.

A Compressed Video Quality Enhancement Algorithm Based on CNN and Transformer Hybrid Network

Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement

RT-VENet: A Convolutional Network for Real-time Video Enhancement.

Deep Convolutional Neural Network For Decompressed Video Enhancement

FastCNN: Towards Fast and Accurate Spatiotemporal Network for HEVC Compressed Video Enhancement.

A CNN-based Prediction-Aware Quality Enhancement Framework for VVC

Exploring Spatiotemporal Relationships for Improving Compressed Video Quality

Valid Information Guidance Network for Compressed Video Quality Enhancement

LEARNING-BASED MULTI-FRAME VIDEO QUALITY ENHANCEMENT

Spatial-Temporal Transformer based Video Compression Framework

Improving Compressed Video Using Single Lightweight Model with Temporal Fusion Module

Compression-Realized Deep Structural Network for Video Quality Enhancement

Compressed Video Quality Enhancement with Motion Approximation and Blended Attention

SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer

Spatial-Temporal Adaptive Compressed Screen Content Video Quality Enhancement

GENERALIZED COMPRESSED VIDEO RESTORATION BY MULTI-SCALE TEMPORAL FUSION AND HIERARCHICAL QUALITY SCORE ESTIMATION

High Efficiency Deep-learning Based Video Compression

Multi-Frame Quality Enhancement for Compressed Video

Compressed Video Quality Enhancement With Temporal Group Alignment and Fusion

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging