SimClone: Detecting Tabular Data Clones using Value Similarity

Xu Yang,Gopi Krishnan Rajbahadur,Dayi Lin,Shaowei Wang,Zhen Ming,Jiang
2024-06-24
Abstract:Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20\% in terms of both F1-score and AUC. In addition, SimClone's visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.
Databases,Artificial Intelligence,Machine Learning,Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to detect data clones between tabular datasets when building AI software. Specifically, the paper aims to overcome the limitations of existing methods and propose a new method, SimClone, to detect data clones in tabular datasets without relying on structural information. ### Problem Background Data clones refer to the existence of identical data copies in multiple datasets. The existence of data clones may lead to the following problems: 1. **Difficulties in data asset management**: It is difficult to manage and track data assets. 2. **Data license violations**: When using datasets containing cloned data to build AI software, data license agreements may be violated. 3. **Data leakage and bias introduction**: Cloned data may lead to data leakage and introduce bias in AI model training and evaluation, thereby affecting the fairness and accuracy of the model. ### Limitations of Existing Methods The existing data clone detection methods mainly have the following limitations: 1. **Only applicable to homogeneous datasets**: Many methods can only detect data clones in homogeneous datasets (such as image datasets) and cannot handle structured heterogeneous datasets (such as tabular datasets). 2. **Dependence on structural information**: Some methods rely on structural or format information (such as row or column headers, formulas, background colors, etc.), which are usually absent in tabular datasets used for AI software development. 3. **Only considering record - level clones**: Some methods only focus on record - level clones (i.e., row - level clones) and ignore column - level clones. ### Innovations of SimClone To solve the above problems, the paper proposes the SimClone method, and its main innovations include: 1. **Detection based on value similarity**: SimClone uses value similarity to detect data clones without relying on any structural information. 2. **Multi - dimensional feature extraction**: SimClone calculates 14 features based on 6 value similarity metrics (such as Jaccard, Textrank, Simhash, Levenshtein, mean and standard deviation). 3. **Supervised learning classifier**: SimClone uses a supervised learning classifier to detect whether there are data clones between two datasets. 4. **Visualization tool**: SimClone also provides a visualization method, combined with SHAP explanation technology, to help users accurately locate the position of cloned data. ### Experimental Results The experimental results show that SimClone significantly outperforms the existing state - of - the - art method LTC on multiple evaluation metrics. For example, on the synthetic test set, the F1 score of SimClone reaches 0.83, which is at least 32.3% higher than that of LTC; on the EUSES real - world dataset, SimClone also performs better than LTC on Precision@K (K = 200). ### Summary In conclusion, this paper solves the key problem of detecting data clones in tabular datasets by proposing the SimClone method, especially in the absence of structural information. SimClone not only improves the detection accuracy but also provides a visualization tool to help users better understand and locate cloned data.