Abstract:Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20\% in terms of both F1-score and AUC. In addition, SimClone's visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to detect data clones between tabular datasets when building AI software. Specifically, the paper aims to overcome the limitations of existing methods and propose a new method, SimClone, to detect data clones in tabular datasets without relying on structural information. ### Problem Background Data clones refer to the existence of identical data copies in multiple datasets. The existence of data clones may lead to the following problems: 1. **Difficulties in data asset management**: It is difficult to manage and track data assets. 2. **Data license violations**: When using datasets containing cloned data to build AI software, data license agreements may be violated. 3. **Data leakage and bias introduction**: Cloned data may lead to data leakage and introduce bias in AI model training and evaluation, thereby affecting the fairness and accuracy of the model. ### Limitations of Existing Methods The existing data clone detection methods mainly have the following limitations: 1. **Only applicable to homogeneous datasets**: Many methods can only detect data clones in homogeneous datasets (such as image datasets) and cannot handle structured heterogeneous datasets (such as tabular datasets). 2. **Dependence on structural information**: Some methods rely on structural or format information (such as row or column headers, formulas, background colors, etc.), which are usually absent in tabular datasets used for AI software development. 3. **Only considering record - level clones**: Some methods only focus on record - level clones (i.e., row - level clones) and ignore column - level clones. ### Innovations of SimClone To solve the above problems, the paper proposes the SimClone method, and its main innovations include: 1. **Detection based on value similarity**: SimClone uses value similarity to detect data clones without relying on any structural information. 2. **Multi - dimensional feature extraction**: SimClone calculates 14 features based on 6 value similarity metrics (such as Jaccard, Textrank, Simhash, Levenshtein, mean and standard deviation). 3. **Supervised learning classifier**: SimClone uses a supervised learning classifier to detect whether there are data clones between two datasets. 4. **Visualization tool**: SimClone also provides a visualization method, combined with SHAP explanation technology, to help users accurately locate the position of cloned data. ### Experimental Results The experimental results show that SimClone significantly outperforms the existing state - of - the - art method LTC on multiple evaluation metrics. For example, on the synthetic test set, the F1 score of SimClone reaches 0.83, which is at least 32.3% higher than that of LTC; on the EUSES real - world dataset, SimClone also performs better than LTC on Precision@K (K = 200). ### Summary In conclusion, this paper solves the key problem of detecting data clones in tabular datasets by proposing the SimClone method, especially in the absence of structural information. SimClone not only improves the detection accuracy but also provides a visualization tool to help users better understand and locate cloned data.

SimClone: Detecting Tabular Data Clones using Value Similarity

SimClone: Detecting Tabular Data Clones using Value Similarity

Code Clone Detection: A Literature Review

Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones Via Deep Learning

Detecting Differences Across Multiple Instances of Code Clones

Code Similarity in Clone Detection

A Machine Learning Based Framework for Code Clone Validation

Learning to Detect Table Clones in Spreadsheets.

DroidCC: A Scalable Clone Detection Approach for Android Applications to Detect Similarity at Source Code Level.

Assessing and Improving Dataset and Evaluation Methodology in Deep Learning for Code Clone Detection

An ensemble learning approach for software semantic clone detection

Go-clone: Graph-Embedding Based Clone Detector for Golang

Code Clone Detection Method for Large-Scale Source Code

A Survey on the Evaluation of Clone Detection Performance and Benchmarking

Gitor: Scalable Code Clone Detection by Building Global Sample Graph

GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench

SCDetector

Clone Detection on Large Scala Codebases

Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree

EClone: detect semantic clones in Ethereum via symbolic transaction sketch.

Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection