VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models

Zihao Zhu,Mingda Zhang,Shaokui Wei,Bingzhe Wu,Baoyuan Wu
2024-04-01
Abstract:The role of data in building AI systems has recently been emphasized by the emerging concept of data-centric AI. Unfortunately, in the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them. The presence of such dirty samples makes the DNNs vunerable and unreliable.Hence, it is critical to detect dirty samples to improve the quality and realiability of dataset. Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other
Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively detect and remove various types of "dirty samples" present in datasets within data-driven AI systems. Specifically, these dirty samples include: 1. **Poisoned Samples**: These samples are typically generated by malicious attackers who intentionally embed triggers in the training dataset and alter the true labels to target labels. Such samples cause deep neural networks (DNNs) to predict any poisoned sample as the target label during the inference phase while maintaining accuracy on clean samples. 2. **Noisy Labels**: These samples usually result from labeling errors caused by human annotators or automated labeling bots in scenarios such as crowdsourcing or web crawling. Training DNNs with datasets containing noisy labels significantly reduces overall performance. 3. **Hybrid Dirty Samples**: A more severe situation occurs when attackers inject poisoned samples into a dataset that already contains noisy labels, resulting in the coexistence of poisoned samples and noisy labels. In this case, the trained model faces both malicious backdoor attacks and performance degradation. Existing detection methods typically can only detect one type of dirty sample and have weak generalization capabilities across different types of dirty samples. Therefore, developing a general framework capable of detecting multiple types of dirty samples simultaneously is of great significance for improving data quality and model reliability. The paper proposes a general detection framework based on Multimodal Large Language Models (MLLM) called Versatile Data Cleanser (VDC), which detects dirty samples by capturing inconsistencies between vision and language, thereby improving detection accuracy and generalization capability.