VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models

Zihao Zhu,Mingda Zhang,Shaokui Wei,Bingzhe Wu,Baoyuan Wu

2024-04-01

Abstract:The role of data in building AI systems has recently been emphasized by the emerging concept of data-centric AI. Unfortunately, in the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them. The presence of such dirty samples makes the DNNs vunerable and unreliable.Hence, it is critical to detect dirty samples to improve the quality and realiability of dataset. Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other

Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of how to effectively detect and remove various types of "dirty samples" present in datasets within data-driven AI systems. Specifically, these dirty samples include: 1. **Poisoned Samples**: These samples are typically generated by malicious attackers who intentionally embed triggers in the training dataset and alter the true labels to target labels. Such samples cause deep neural networks (DNNs) to predict any poisoned sample as the target label during the inference phase while maintaining accuracy on clean samples. 2. **Noisy Labels**: These samples usually result from labeling errors caused by human annotators or automated labeling bots in scenarios such as crowdsourcing or web crawling. Training DNNs with datasets containing noisy labels significantly reduces overall performance. 3. **Hybrid Dirty Samples**: A more severe situation occurs when attackers inject poisoned samples into a dataset that already contains noisy labels, resulting in the coexistence of poisoned samples and noisy labels. In this case, the trained model faces both malicious backdoor attacks and performance degradation. Existing detection methods typically can only detect one type of dirty sample and have weak generalization capabilities across different types of dirty samples. Therefore, developing a general framework capable of detecting multiple types of dirty samples simultaneously is of great significance for improving data quality and model reliability. The paper proposes a general detection framework based on Multimodal Large Language Models (MLLM) called Versatile Data Cleanser (VDC), which detects dirty samples by capturing inconsistencies between vision and language, thereby improving detection accuracy and generalization capability.

VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models

Data Cleaning Using Large Language Models

IterClean: an Iterative Data Cleaning Framework with Large Language Models

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models

CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Data Contamination Can Cross Language Barriers

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Clean Evaluations on Contaminated Visual Language Models

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

A Taxonomy for Data Contamination in Large Language Models

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

A Hybrid Data Cleaning Framework Using Markov Logic Networks

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

Data Contamination Calibration for Black-box LLMs

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Inconsistency Ranking-based Noisy Label Detection for High-quality Data

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation