MxT: Mamba x Transformer for Image Inpainting

Shuang Chen,Amir Atapour-Abarghouei,Haozheng Zhang,Hubert P. H. Shum
2024-08-16
Abstract:Image inpainting, or image completion, is a crucial task in computer vision that aims to restore missing or damaged regions of images with semantically coherent content. This technique requires a precise balance of local texture replication and global contextual understanding to ensure the restored image integrates seamlessly with its surroundings. Traditional methods using Convolutional Neural Networks (CNNs) are effective at capturing local patterns but often struggle with broader contextual relationships due to the limited receptive fields. Recent advancements have incorporated transformers, leveraging their ability to understand global interactions. However, these methods face computational inefficiencies and struggle to maintain fine-grained details. To overcome these challenges, we introduce MxT composed of the proposed Hybrid Module (HM), which combines Mamba with the transformer in a synergistic manner. Mamba is adept at efficiently processing long sequences with linear computational costs, making it an ideal complement to the transformer for handling long-scale data interactions. Our HM facilitates dual-level interaction learning at both pixel and patch levels, greatly enhancing the model to reconstruct images with high quality and contextual accuracy. We evaluate MxT on the widely-used CelebA-HQ and Places2-standard datasets, where it consistently outperformed existing state-of-the-art methods. The code will be released: {\url{<a class="link-external link-https" href="https://github.com/ChrisChen1023/MxT" rel="external noopener nofollow">this https URL</a>}}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily addresses several key issues in image inpainting: 1. **Balancing local texture replication and global context understanding**: The task of image inpainting requires a precise balance between replicating local textures and understanding the global context to ensure that the repaired image seamlessly integrates with its surroundings. Traditional methods such as Convolutional Neural Networks (CNNs) perform well in capturing local patterns but are limited by their restricted receptive field when dealing with broader contextual relationships. 2. **Limitations of existing methods**: Although recent research has introduced Transformers into image inpainting to leverage their ability to capture global interactions, these methods face issues of computational inefficiency and difficulty in preserving detailed information. To address these problems, the authors propose the M×T model, which combines the advantages of the Mamba module and Transformers. The Mamba module efficiently handles long sequence data while maintaining linear computational complexity, and the Transformer excels at capturing global interactions between local regions. This combination allows the model to perform dual-layer interactive learning at both the pixel and block levels, significantly improving the quality and contextual accuracy of image inpainting. Experimental results show that M×T performs excellently on two widely used datasets, CelebA-HQ and Places2, outperforming existing state-of-the-art methods. Additionally, M×T demonstrates strong performance in high-resolution image inpainting tasks, indicating its potential for practical applications.