EViTIB: Efficient Vision Transformer Via Inductive Bias Exploration for Image Super-Resolution

Anni Yu,Zhong-Han Niu,Jia-Xin Xie,Qing-Long Zhang,Yu-Bin Yang
DOI: https://doi.org/10.1109/ijcnn60899.2024.10651235
2024-01-01
Abstract:Transformers have exhibited considerable promise in image super-resolution (SR) owing to their capability of establishing long-range dependencies. Nonetheless, vision transformers approach an image as a 1D token sequence, lacking inductive biases to model local visual patterns and scale invariance, which are essential for recovering local details. To address these challenges, we introduce EViTIB, a transformer-based image super-resolution network that integrates the inherent inductive biases of CNNs. EViTIB adopts a concurrent structure where each transformer layer incorporates a convolution branch in parallel with the multi-head self-attention branch. The features from these two branches are subsequently aggregated via a Hybrid Feature Coupling (HFC) module. Consequently, EViTIB takes advantage of locality inductive biases while maintaining the capacity to encompass global dependencies. Extensive experiments demonstrate that, under comparable parameter complexity and FLOPs, EViTIB outperforms recent state-of-the-art SR methods.
What problem does this paper attempt to address?