Convformer: Dual-Stream Vision Transformers and Convolutional Networks for Image Restoration

Changzhi Yang,Huihui Pan,Jue Wang
DOI: https://doi.org/10.1109/tim.2024.3413168
IF: 5.6
2024-01-01
IEEE Transactions on Instrumentation and Measurement
Abstract:Vision Transformers (VITs) and convolutional neural networks (CNNs) have achieved impressive success in computer vision tasks. VITs perceive local and global semantic information in both deep and shallow network layers, while CNNs strictly follow a stepwise perception process of refining global features from local features. In this work, we propose Convformer, a dual-stream encoder-decoder network for image restoration, combining the advantages of both VITs and CNNs. The core of our model is a Convformer block containing three key components. First, we present a VIT block, in which we propose a sampling multihead attention (SMHA) mechanism and a gated-sampling feedforward network (GSFN), encouraging capturing long-distance dependencies and reducing computational complexity significantly. Second, we present a CNN block, which can be cascaded like Transformers. Third, we introduce a dual mutual attention (DMA) mechanism to share semantic information between dual streams. Our DMA mechanism is composed of a local feature attention (LFA) mechanism through which local features flow into the VIT stream from the CNN stream and a global feature attention (GFA) mechanism through which global features flow into the CNN stream from the VIT stream. We evaluate Convformer on several image restoration tasks, including image denoising and motion deblurring. Extensive experiments demonstrate that our Convformer outperforms the previous state-of-the-art methods in terms of both peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) with less computational complexity.
What problem does this paper attempt to address?