Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection

Yuanze Li,Haolin Wang,Shihao Yuan,Ming Liu,Debin Zhao,Yiwen Guo,Chen Xu,Guangming Shi,Wangmeng Zuo

2023-11-01

Abstract:Existing industrial anomaly detection (IAD) methods predict anomaly scores for both anomaly detection and localization. However, they struggle to perform a multi-turn dialog and detailed descriptions for anomaly regions, e.g., color, shape, and categories of industrial anomalies. Recently, large multimodal (i.e., vision and language) models (LMMs) have shown eminent perception abilities on multiple vision tasks such as image captioning, visual understanding, visual reasoning, etc., making it a competitive potential choice for more comprehensible anomaly detection. However, the knowledge about anomaly detection is absent in existing general LMMs, while training a specific LMM for anomaly detection requires a tremendous amount of annotated data and massive computation resources. In this paper, we propose a novel large multi-modal model by applying vision experts for industrial anomaly detection (dubbed Myriad), which leads to definite anomaly detection and high-quality anomaly description. Specifically, we adopt MiniGPT-4 as the base LMM and design an Expert Perception module to embed the prior knowledge from vision experts as tokens which are intelligible to Large Language Models (LLMs). To compensate for the errors and confusions of vision experts, we introduce a domain adapter to bridge the visual representation gaps between generic and industrial images. Furthermore, we propose a Vision Expert Instructor, which enables the Q-Former to generate IAD domain vision-language tokens according to vision expert prior. Extensive experiments on MVTec-AD and VisA benchmarks demonstrate that our proposed method not only performs favorably against state-of-the-art methods under the 1-class and few-shot settings, but also provide definite anomaly prediction along with detailed descriptions in IAD domain.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issues of multi-turn dialogue and detailed description in Industrial Anomaly Detection (IAD). Existing IAD methods mainly focus on anomaly scoring and anomaly region localization but lack in providing specific information such as color, shape, and category. Moreover, these methods require independent modeling for different anomaly scenarios, leading to weak practical deployment capabilities and excessive resource consumption. To solve these problems, the paper proposes a new large-scale multimodal model—Myriad, which combines the knowledge of visual experts. This model not only performs explicit anomaly detection but also generates high-quality anomaly descriptions. Specifically, the authors use MiniGPT-4 as the base model and design an "Expert Perception Module" to embed the prior knowledge of visual experts into tokens that can be understood by large language models. To compensate for potential errors and confusions from visual experts, a domain adapter is introduced to bridge the representation gap between general images and industrial images. Additionally, a "Visual Expert Guide" is proposed, enabling Q-Former to generate visual-language tokens in the IAD domain based on the prior knowledge of visual experts. Experimental results show that Myriad not only outperforms existing methods on the MVTec-AD and VisA benchmark datasets but also provides explicit anomaly predictions and detailed descriptions.

Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection

MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection

ADAGENT: Anomaly Detection Agent with Multimodal Large Models in Adverse Environments

Incomplete Multimodal Industrial Anomaly Detection via Cross-Modal Distillation

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

See it, Think it, Sorted: Large Multimodal Models are Few-shot Time Series Anomaly Analyzers

Anomaly Detection by Adapting a pre-trained Vision Language Model

Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead

Multimodal Industrial Anomaly Detection via Hybrid Fusion

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection

Multi-modal Auto-regressive Modeling via Visual Words

Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

Improving Vision Anomaly Detection with the Guidance of Language Modality

Effectiveness Assessment of Recent Large Vision-Language Models

Large-Scale Visual Language Model Boosted by Contrast Domain Adaptation for Intelligent Industrial Visual Monitoring