Abstract:Visual anomaly detection is critical in industrial manufacturing, but traditional methods often rely on extensive normal datasets and custom models, limiting scalability. Recent advancements in large-scale visual-language models have significantly improved zero/few-shot anomaly detection. However, these approaches may not fully utilize hierarchical features, potentially missing nuanced details. We introduce a window self-attention mechanism based on the CLIP model, combined with learnable prompts to process multi-level features within a Soldier-Offier Window self-Attention (SOWA) framework. Our method has been tested on five benchmark datasets, demonstrating superior performance by leading in 18 out of 20 metrics compared to existing state-of-the-art techniques.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by Visual Anomaly Detection (VAD) in industrial manufacturing. Specifically, traditional methods rely on a large number of normal sample data sets and customized models, which limits their scalability. Moreover, although existing large - scale vision - language models (such as CLIP) have made significant progress in zero - shot / few - shot anomaly detection, these methods may not fully utilize hierarchical features and may thus miss subtle abnormal details. ### Specific description of the problem 1. **Data scarcity and task diversity**: - In industrial manufacturing, defective samples are scarce and vary widely, resulting in insufficient training data. - Existing methods usually require a large number of normal samples, and each task requires a customized model, making it difficult to scale across tasks. 2. **Limitations of existing methods**: - Although large - scale vision - language models perform well in zero - shot / few - shot anomaly detection, they do not fully utilize multi - scale features, especially local features. - Fixed - encoded text prompts are mainly used for global features and cannot effectively balance shallow features, affecting the ability to detect anomalies at different scales. ### Solution To solve the above problems, the paper proposes the Soldier - Officer Window self - Attention (SOWA) framework, aiming to improve anomaly detection in the following ways: 1. **Introducing window self - attention mechanism**: - Combined with the CLIP model, use the window self - attention mechanism to process multi - level features and enhance the ability to capture anomalies at different scales. - By freezing the attention weights of CLIP and injecting them into the window self - attention, inherit its feature extraction ability and combine more extensive context information. 2. **Learning text prompts**: - Introduce learnable text prompts to adapt to different levels of features and enhance the ability to distinguish between abnormal and normal states. - Use the "abnormal [cls]" template for general abnormal prompts to avoid the complexity of manually defining specific abnormal templates. 3. **Multi - scale feature fusion**: - Fuse visual features at different levels with text features to form the final feature representation, thereby better capturing multi - level information. ### Experimental verification The paper conducted experiments on five benchmark data sets, and the results show that the SOWA framework outperforms existing state - of - the - art methods in multiple evaluation metrics, especially in zero - shot / few - shot settings. ### Summary The main contribution of this paper is to propose a new SOWA framework. By combining the window self - attention mechanism and learnable text prompts, it solves the deficiency of existing methods in multi - scale feature utilization and improves the accuracy and robustness of visual anomaly detection.

SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

Anomaly Detection by Adapting a pre-trained Vision Language Model

Self-Attention Memory-Augmented Wavelet-CNN for Anomaly Detection

Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection

VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection

Open-Vocabulary Video Anomaly Detection

Configurable Spatial-Temporal Hierarchical Analysis for Flexible Video Anomaly Detection

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Hawk: Learning to Understand Open-World Video Anomalies

Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection

FADE: Few-shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection