PTMNet: Pixel-Text Matching Network for Zero-Shot Anomaly Detection

Huilin Deng,Yanming Guo,Zhenyi Xu,Yu Kang
DOI: https://doi.org/10.1109/BigDIA60676.2023.10429521
2023-01-01
Abstract:Visual anomaly detection is crucial for automating industrial quality inspection. However, prior research has primarily involved training custom models for specific scenarios, limiting their flexibility and scalability. Zero-shot anomaly detection, based on the vision-language model, leverages robust natural language supervision to enable defect detection in new categories without the need for dedicated training. Recently CLIP exhibits impressive transferability on zero-shot classification tasks. However, CLIP’s effectiveness in zero-shot anomaly detection is limited due to the lack of defect-related knowledge and the complexity of extending image-text pair matching to per-pixel prediction tasks. In this study, we propose a novel framework for independently learning text representations for normal and anomalous cases and enhance modal fusion to capture pixel-text associations within the text-image joint space. Specifically, we transform image-text matching into pixel-text matching to generate pixel-text score maps. These score maps are further refined using a transformer fusion module, effectively strengthening modal fusion and improving the utilization of pretrained knowledge. Additionally, to enhance the learning of visual representations, we introduce an extra linear layer following the image encoder. Extensive experiments demonstrate the superior performance of our methods.
What problem does this paper attempt to address?