A Pre-trained Data Deduplication Model based on Active Learning

Xinyao Liu,Shengdong Du,Fengmao Lv,Hongtao Xue,Jie Hu,Tianrui Li
2024-03-20
Abstract:In the era of big data, the issue of data quality has become increasingly prominent. One of the main challenges is the problem of duplicate data, which can arise from repeated entry or the merging of multiple data sources. These "dirty data" problems can significantly limit the effective application of big data. To address the issue of data deduplication, we propose a pre-trained deduplication model based on active learning, which is the first work that utilizes active learning to address the problem of deduplication at the semantic level. The model is built on a pre-trained Transformer and fine-tuned to solve the deduplication problem as a sequence to classification task, which firstly integrate the transformer with active learning into an end-to-end architecture to select the most valuable data for deduplication model training, and also firstly employ the R-Drop method to perform data augmentation on each round of labeled data, which can reduce the cost of manual labeling and improve the model's performance. Experimental results demonstrate that our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification, achieving up to a 28% improvement in Recall score on benchmark datasets.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of data deduplication in the era of big data. With the explosive growth of digital data, effectively managing storage costs and improving data quality has become crucial. Data deduplication can effectively save storage space and significantly enhance data quality, making the detection of duplicate data a widely discussed research topic. The authors propose a Pre-trained Deep Active Learning Model for Data Deduplication (PDDM-AL) for semantic-level data deduplication. This model combines knowledge-enhanced transformers with an active learning mechanism, iteratively optimizing to gradually improve deduplication performance. Additionally, the paper introduces the R-Drop method for data augmentation to reduce the cost of manual annotation and improve model performance. Experimental results show that PDDM-AL outperforms existing state-of-the-art methods in deduplication recognition performance on benchmark datasets, with the highest Recall score improving by 28%. Specifically, PDDM-AL excels in precision, recall, and F1 score, demonstrating stronger robustness, especially when handling dirty data. Furthermore, through the active learning strategy, the model can quickly improve performance with only a small amount of labeled data, further validating its effectiveness.