CTNet: hybrid architecture based on CNN and transformer for image inpainting detection

Fengjun Xiao,Zhuxi Zhang,Ye Yao
DOI: https://doi.org/10.1007/s00530-023-01184-w
IF: 3.9
2023-09-19
Multimedia Systems
Abstract:Digital image inpainting technology has increasingly gained popularity as a result of the development of image processing and machine vision. However, digital image inpainting can be used not only to repair damaged photographs, but also to remove specific people or distort the semantic content of images. To address the issue of image inpainting forgeries, a hybrid CNN-Transformer Network (CTNet), which is composed of the hybrid CNN-Transformer encoder, the feature enhancement module, and the decoder module, is proposed for image inpainting detection and localization. Different from existing inpainting detection methods that rely on hand-crafted attention mechanisms, the hybrid CNN-Transformer encoder employs CNN as a feature extractor to build feature maps tokenized as the input patches of the Transformer encoder. The hybrid structure exploits the innate global self-attention mechanisms of Transformer and can effectively capture the long-term dependency of the image. Since inpainting traces mainly exist in the high-frequency components of digital images, the feature enhancement module performs feature extraction in the frequency domain. The decoder regularizes the upsampling process of the predicted masks with the assistance of high-frequency features. We investigate the generalization capacity of our CTNet on datasets generated by ten commonly used inpainting methods. The experimental results show that the proposed model can detect a variety of unknown inpainting operations after being trained on the datasets generated by a single inpainting method.
computer science, information systems, theory & methods
What problem does this paper attempt to address?