Causal Fusion of Convolutional Neural Network and Vision Transformer for Image Anomaly Detection and Localization

Shuo Zhang,Xiongpeng Hu,Jing Liu
DOI: https://doi.org/10.1109/icme57554.2024.10687979
2024-01-01
Abstract:To address the challenge of visual anomaly detection amidst complex background interference. First, we construct a structural causal model for anomaly detection under complex background interference and propose an intervention strategy to block background feature interference. Then, we build an anomaly feature-sensitive neural network (AFSNN) containing two feature extraction modules based on the causal intervention strategy. Given the limitations of convolutional neural networks in capturing global features associated with spatial location dependence, and the substantial data requirements of vision transformers, we opt for the enhanced Swin Transformer module and the deformable convolutional networks encoder module to extract global features and local details, respectively. We also designed the cross-attention to fuse these two scales of feature representation. Finally, we introduce a causality-sensitive learning module that differentiates the outputs of the two feature extraction modules and constructs a causality-sensitive loss function by maximizing the output differences. This approach blocks background features and enhances sensitivity to anomaly features during training. Experiments show that AFSNN can effectively attenuate the confusing interference of the background pattern.
What problem does this paper attempt to address?