MxT: Mamba x Transformer for Image Inpainting

Shuang Chen,Amir Atapour-Abarghouei,Haozheng Zhang,Hubert P. H. Shum

2024-08-16

Abstract:Image inpainting, or image completion, is a crucial task in computer vision that aims to restore missing or damaged regions of images with semantically coherent content. This technique requires a precise balance of local texture replication and global contextual understanding to ensure the restored image integrates seamlessly with its surroundings. Traditional methods using Convolutional Neural Networks (CNNs) are effective at capturing local patterns but often struggle with broader contextual relationships due to the limited receptive fields. Recent advancements have incorporated transformers, leveraging their ability to understand global interactions. However, these methods face computational inefficiencies and struggle to maintain fine-grained details. To overcome these challenges, we introduce MxT composed of the proposed Hybrid Module (HM), which combines Mamba with the transformer in a synergistic manner. Mamba is adept at efficiently processing long sequences with linear computational costs, making it an ideal complement to the transformer for handling long-scale data interactions. Our HM facilitates dual-level interaction learning at both pixel and patch levels, greatly enhancing the model to reconstruct images with high quality and contextual accuracy. We evaluate MxT on the widely-used CelebA-HQ and Places2-standard datasets, where it consistently outperformed existing state-of-the-art methods. The code will be released: {\url{<a class="link-external link-https" href="https://github.com/ChrisChen1023/MxT" rel="external noopener nofollow">this https URL</a>}}.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily addresses several key issues in image inpainting: 1. **Balancing local texture replication and global context understanding**: The task of image inpainting requires a precise balance between replicating local textures and understanding the global context to ensure that the repaired image seamlessly integrates with its surroundings. Traditional methods such as Convolutional Neural Networks (CNNs) perform well in capturing local patterns but are limited by their restricted receptive field when dealing with broader contextual relationships. 2. **Limitations of existing methods**: Although recent research has introduced Transformers into image inpainting to leverage their ability to capture global interactions, these methods face issues of computational inefficiency and difficulty in preserving detailed information. To address these problems, the authors propose the M×T model, which combines the advantages of the Mamba module and Transformers. The Mamba module efficiently handles long sequence data while maintaining linear computational complexity, and the Transformer excels at capturing global interactions between local regions. This combination allows the model to perform dual-layer interactive learning at both the pixel and block levels, significantly improving the quality and contextual accuracy of image inpainting. Experimental results show that M×T performs excellently on two widely used datasets, CelebA-HQ and Places2, outperforming existing state-of-the-art methods. Additionally, M×T demonstrates strong performance in high-resolution image inpainting tasks, indicating its potential for practical applications.

MxT: Mamba x Transformer for Image Inpainting

Mixed Transformer U-Net for Medical Image Segmentation

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

Multi-dimensional Visual Prompt Enhanced Image Restoration via Mamba-Transformer Aggregation

HMT-UNet: A hybird Mamba-Transformer Vision UNet for Medical Image Segmentation

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Transformer for Image Harmonization and Beyond

MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

MAT: Mask-Aware Transformer for Large Hole Image Inpainting

CMT: Convolutional Neural Networks Meet Vision Transformers

HMT-Grasp: A Hybrid Mamba-Transformer Approach for Robot Grasping in Cluttered Environments

HyperMamba: A Spectral-Spatial Adaptive Mamba for Hyperspectral Image Classification

MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders

Image Harmonization with Transformer

Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters

MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation

Transformer with multi-level grid features and depth pooling for image captioning

MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

HINT: High-quality INpainting Transformer with Mask-Aware Encoding and Enhanced Attention

ITrans: generative image inpainting with transformers