Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection

Tsun-Hin Cheung,Ka-Chun Fung,Songjiang Lai,Kwan-Ho Lin,Vincent Ng,Kin-Man Lam
2024-11-28
Abstract:Identifying defects and anomalies in industrial products is a critical quality control task. Traditional manual inspection methods are slow, subjective, and error-prone. In this work, we propose a novel zero-shot training-free approach for automated industrial image anomaly detection using a multimodal machine learning pipeline, consisting of three foundation models. Our method first uses a large language model, i.e., GPT-3. generate text prompts describing the expected appearances of normal and abnormal products. We then use a grounding object detection model, called Grounding DINO, to locate the product in the image. Finally, we compare the cropped product image patches to the generated prompts using a zero-shot image-text matching model, called CLIP, to identify any anomalies. Our experiments on two datasets of industrial product images, namely MVTec-AD and VisA, demonstrate the effectiveness of this method, achieving high accuracy in detecting various types of defects and anomalies without the need for model training. Our proposed model enables efficient, scalable, and objective quality control in industrial manufacturing settings.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is defect and anomaly detection in industrial product images. Traditional manual inspection methods have the disadvantages of being slow, highly subjective and error - prone. To address these issues, the author proposes a novel zero - shot, no - training method for automated industrial image anomaly detection. Specifically, this method is implemented through a multimodal machine - learning pipeline, including three base models: 1. **Text Prompt Generation**: Use a large - language model (such as GPT - 3) to generate text prompts describing normal and abnormal products. 2. **Object Localization**: Use the Grounding DINO model to locate products in the image, in order to reduce the influence of background noise and handle multi - resolution challenges. 3. **Zero - shot Image - Text Matching**: Use the pre - trained CLIP model to compare the cropped product image with the generated text prompts to identify any anomalies. ### Formula Representation In terms of formulas, the following are the mathematical representations of the key steps: 1. **Text Prompt Generation**: \[ P_{\text{normal}}=\text{GPT}-3(x_{\text{normal}}) \] \[ P_{\text{anomaly}}=\text{GPT}-3(x_{\text{anomaly}}) \] where \( P_{\text{normal}} \) and \( P_{\text{anomaly}} \) are two sets of text prompts describing normal and abnormal products respectively. 2. **Object Localization**: \[ I_{\text{object}} = I[y:y + h,x:x + w] \] where \( I \) is the input image, \( b=[x,y,w,h] \) are the bounding box coordinates output by Grounding DINO, and \( I_{\text{object}} \) is the cropped image area containing the product. 3. **Zero - shot Anomaly Detection**: \[ s=\frac{e_{\text{fused}}\cdot t_{\text{anomaly}}}{e_{\text{fused}}\cdot t_{\text{anomaly}}+e_{\text{fused}}\cdot t_{\text{normal}}} \] where \( e_{\text{fused}} \) is the fused feature vector, \( t_{\text{normal}} \) and \( t_{\text{anomaly}} \) are the text embedding vectors of the "normal" and "anomaly" prompts respectively, and \( s \) is the anomaly score. ### Summary Through this method, the author aims to improve the efficiency, scalability and objectivity of quality control in the industrial manufacturing environment without a large amount of labeled training data. Experimental results show that this method performs well on two industrial product image datasets (MVTec - AD and VisA) and can accurately detect various types of defects and anomalies.