SimClone: Detecting Tabular Data Clones using Value Similarity

Xu Yang,Gopi Krishnan Rajbahadur,Dayi Lin,Shaowei Wang,Zhen Ming (Jack) Jiang
DOI: https://doi.org/10.1145/3676961
IF: 3.685
2024-07-16
ACM Transactions on Software Engineering and Methodology
Abstract:Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20% in terms of both F1-score and AUC. In addition, SimClone’s visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.
computer science, software engineering
What problem does this paper attempt to address?