A Pre-trained Data Deduplication Model based on Active Learning

Xinyao Liu,Shengdong Du,Fengmao Lv,Hongtao Xue,Jie Hu,Tianrui Li

2024-03-20

Abstract:In the era of big data, the issue of data quality has become increasingly prominent. One of the main challenges is the problem of duplicate data, which can arise from repeated entry or the merging of multiple data sources. These "dirty data" problems can significantly limit the effective application of big data. To address the issue of data deduplication, we propose a pre-trained deduplication model based on active learning, which is the first work that utilizes active learning to address the problem of deduplication at the semantic level. The model is built on a pre-trained Transformer and fine-tuned to solve the deduplication problem as a sequence to classification task, which firstly integrate the transformer with active learning into an end-to-end architecture to select the most valuable data for deduplication model training, and also firstly employ the R-Drop method to perform data augmentation on each round of labeled data, which can reduce the cost of manual labeling and improve the model's performance. Experimental results demonstrate that our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification, achieving up to a 28% improvement in Recall score on benchmark datasets.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the issue of data deduplication in the era of big data. With the explosive growth of digital data, effectively managing storage costs and improving data quality has become crucial. Data deduplication can effectively save storage space and significantly enhance data quality, making the detection of duplicate data a widely discussed research topic. The authors propose a Pre-trained Deep Active Learning Model for Data Deduplication (PDDM-AL) for semantic-level data deduplication. This model combines knowledge-enhanced transformers with an active learning mechanism, iteratively optimizing to gradually improve deduplication performance. Additionally, the paper introduces the R-Drop method for data augmentation to reduce the cost of manual annotation and improve model performance. Experimental results show that PDDM-AL outperforms existing state-of-the-art methods in deduplication recognition performance on benchmark datasets, with the highest Recall score improving by 28%. Specifically, PDDM-AL excels in precision, recall, and F1 score, demonstrating stronger robustness, especially when handling dirty data. Furthermore, through the active learning strategy, the model can quickly improve performance with only a small amount of labeled data, further validating its effectiveness.

A Pre-trained Data Deduplication Model based on Active Learning

Graph Deep Active Learning Framework for Data Deduplication

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

DCCD: Reducing Neural Network Redundancy Via Distillation

Deduplicating Training Data Makes Language Models Better

Up to 100x Faster Data-Free Knowledge Distillation

Generative Deduplication For Socia Media Data Selection

Accelerating Dataset Distillation Via Model Augmentation

Data-Free Adversarial Distillation

DUEL: Duplicate Elimination on Active Memory for Self-Supervised Class-Imbalanced Learning

Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance

Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers

Adversarial Self-Supervised Data-Free Distillation for Text Classification

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

An In-Depth Analysis of Data Reduction Methods for Sustainable Deep Learning

Active Data Acquisition in Autonomous Driving Simulation

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

UniDrop: A Simple Yet Effective Technique to Improve Transformer Without Extra Cost.

Active Data Curation Effectively Distills Large-Scale Multimodal Models

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication