Abstract:Industrial anomaly classification (AC) is an indispensable task in industrial manufacturing, which guarantees quality and safety of various product. To address the scarcity of data in industrial scenarios, lots of few-shot anomaly detection methods emerge recently. In this paper, we propose an effective few-shot anomaly classification (FSAC) framework with one-stage training, dubbed CLIP-FSAC++. Specifically, we introduce a cross-modality interaction module named Anomaly Descriptor following image and text encoders, which enhances the correlation of visual and text embeddings and adapts the representations of CLIP from pre-trained data to target data. In anomaly descriptor, image-to-text cross-attention module is used to obtain image-specific text embeddings and text-to-image cross-attention module is used to obtain text-specific visual embeddings. Then these modality-specific embeddings are used to enhance original representations of CLIP for better matching ability. Comprehensive experiment results are provided for evaluating our method in few-normal shot anomaly classification on VisA and MVTEC-AD for 1, 2, 4 and 8-shot settings. The source codes are at <a class="link-external link-https" href="https://github.com/Jay-zzcoder/clip-fsac-pp" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
This paper attempts to solve the data scarcity problem in the Anomaly Classification (AC) task during the industrial manufacturing process. Specifically, the author proposes an effective few - shot anomaly classification framework, CLIP - FSAC++, to deal with the scarcity of abnormal samples in industrial scenarios. The following are the specific problems that this paper attempts to solve:
1. **Data Scarcity Problem**:
- In the industrial manufacturing environment, the appearance of abnormal samples is very rare, making it difficult to collect sufficient abnormal data for model training.
- Meanwhile, the data labeling process is time - consuming and labor - intensive, so traditional supervised learning methods are difficult to apply in these scenarios.
2. **Limitations of Existing Methods**:
- Unsupervised anomaly detection methods can be trained without abnormal samples, but their performance cannot meet all requirements.
- Existing few - shot anomaly detection methods (Few - Shot Anomaly Detection, FSAD) can handle a small number of normal samples, but they still have deficiencies in practical applications, such as high computational cost and limited generalization ability.
3. **Cross - Modal Matching Problem**:
- In the anomaly classification task, the matching between visual and text descriptions is crucial. However, finding accurate text prompts to describe normal and abnormal situations is a challenge, which will lead to the problem of visual - language mismatch.
- The image distribution in industrial scenarios is quite different from the natural image distribution in the pre - training dataset, resulting in insufficient visual representation.
To solve the above problems, the author proposes the CLIP - FSAC++ framework. By introducing lightweight image and text adapters and a cross - modal interaction module (Anomaly Descriptor), it enhances the matching and generalization abilities of CLIP in few - shot anomaly classification. Specific improvements include:
- **Introducing Lightweight Adapters**: Adjust the prior representation of CLIP through image and text adapters to make it more suitable for the industrial field.
- **Designing a Cross - Modal Interaction Module**: Through the cross - attention mechanism from image to text and from text to image, enhance the correlation between visual and text features, thereby improving classification performance.
- **Simplifying the Training Strategy**: Adopt a joint training strategy instead of a two - stage training strategy, simplifying the training process and saving computational cost.
Through these improvements, the experimental results of CLIP - FSAC++ on the VisA and MVTEC - AD datasets show that it outperforms existing few - shot anomaly detection methods in 1 - shot, 2 - shot, 4 - shot, and 8 - shot settings, and even exceeds some full - sample anomaly detection methods.