SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection

Zongxiang Hu,Zhaosheng Zhang
2024-07-30
Abstract:Visual anomaly detection is critical in industrial manufacturing, but traditional methods often rely on extensive normal datasets and custom models, limiting scalability. Recent advancements in large-scale visual-language models have significantly improved zero/few-shot anomaly detection. However, these approaches may not fully utilize hierarchical features, potentially missing nuanced details. We introduce a window self-attention mechanism based on the CLIP model, combined with learnable prompts to process multi-level features within a Soldier-Offier Window self-Attention (SOWA) framework. Our method has been tested on five benchmark datasets, demonstrating superior performance by leading in 18 out of 20 metrics compared to existing state-of-the-art techniques.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by Visual Anomaly Detection (VAD) in industrial manufacturing. Specifically, traditional methods rely on a large number of normal sample data sets and customized models, which limits their scalability. Moreover, although existing large - scale vision - language models (such as CLIP) have made significant progress in zero - shot / few - shot anomaly detection, these methods may not fully utilize hierarchical features and may thus miss subtle abnormal details. ### Specific description of the problem 1. **Data scarcity and task diversity**: - In industrial manufacturing, defective samples are scarce and vary widely, resulting in insufficient training data. - Existing methods usually require a large number of normal samples, and each task requires a customized model, making it difficult to scale across tasks. 2. **Limitations of existing methods**: - Although large - scale vision - language models perform well in zero - shot / few - shot anomaly detection, they do not fully utilize multi - scale features, especially local features. - Fixed - encoded text prompts are mainly used for global features and cannot effectively balance shallow features, affecting the ability to detect anomalies at different scales. ### Solution To solve the above problems, the paper proposes the Soldier - Officer Window self - Attention (SOWA) framework, aiming to improve anomaly detection in the following ways: 1. **Introducing window self - attention mechanism**: - Combined with the CLIP model, use the window self - attention mechanism to process multi - level features and enhance the ability to capture anomalies at different scales. - By freezing the attention weights of CLIP and injecting them into the window self - attention, inherit its feature extraction ability and combine more extensive context information. 2. **Learning text prompts**: - Introduce learnable text prompts to adapt to different levels of features and enhance the ability to distinguish between abnormal and normal states. - Use the "abnormal [cls]" template for general abnormal prompts to avoid the complexity of manually defining specific abnormal templates. 3. **Multi - scale feature fusion**: - Fuse visual features at different levels with text features to form the final feature representation, thereby better capturing multi - level information. ### Experimental verification The paper conducted experiments on five benchmark data sets, and the results show that the SOWA framework outperforms existing state - of - the - art methods in multiple evaluation metrics, especially in zero - shot / few - shot settings. ### Summary The main contribution of this paper is to propose a new SOWA framework. By combining the window self - attention mechanism and learnable text prompts, it solves the deficiency of existing methods in multi - scale feature utilization and improves the accuracy and robustness of visual anomaly detection.