A Compressed Video Quality Enhancement Algorithm Based on CNN and Transformer Hybrid Network

Hao Li,Xiaohai He,Shuhua Xiong,Haibo He,Honggang Chen
DOI: https://doi.org/10.1007/s11227-024-06654-0
IF: 3.3
2024-01-01
The Journal of Supercomputing
Abstract:Convolutional neural network (CNN)-based algorithms perform well in enhancing video quality by removing artifacts in compressed videos. Existing state-of-the-art approaches primarily concentrate on leveraging the spatiotemporal details from neighboring frames through deformable convolution. Nonetheless, the training of offset fields in deformable convolution poses significant challenges, as their instability during training frequently results in offset overflow, which reduces the efficiency of correlation modeling. On the other hand, convolution alone proves insufficient for effectively modeling long-range dependencies. We introduce a CNN and transformer-based compressed video quality enhancement (CTVE) method, which comprises three essential modules: the feature initial processing (FIP) module, the feature further processing (FFP) module, and the reconstruction module. The FIP module is built upon the deformable convolution (DCN), enabling it to initially extract spatiotemporal information from neighboring frames. The FFP module is based on Swinv2-transformer, which can accurately model the relevant contextual information and adapt well to image content. Extensive experimentation conducted on JCT-VT test sequences demonstrates that our method achieves outstanding average performance in both subjective and objective quality assessments.
What problem does this paper attempt to address?