Abstract:Program similarity has become an increasingly popular area of research with various security applications such as plagiarism detection, author identification, and malware analysis. However, program similarity research faces a few unique dataset quality problems in evaluating the effectiveness of novel approaches. First, few high-quality datasets for binary program similarity exist and are widely used in this domain. Second, there are potentially many different, disparate definitions of what makes one program similar to another and in many cases there is often a large semantic gap between the labels provided by a dataset and any useful notion of behavioral or semantic similarity. In this paper, we present HELIX - a framework for generating large, synthetic program similarity datasets. We also introduce Blind HELIX, a tool built on top of HELIX for extracting HELIX components from library code automatically using program slicing. We evaluate HELIX and Blind HELIX by comparing the performance of program similarity tools on a HELIX dataset to a hand-crafted dataset built from multiple, disparate notions of program similarity. Using Blind HELIX, we show that HELIX can generate realistic and useful datasets of virtually infinite size for program similarity research with ground truth labels that embody practical notions of program similarity. Finally, we discuss the results and reason about relative tool ranking.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of the lack of high - quality data sets in program similarity research. Specifically, the author points out the following challenges: 1. **Lack of high - quality binary program similarity data sets**: - Existing data sets are usually of low quality and it is difficult to capture the useful concept of program similarity in practical applications. - There is a large semantic gap between the labels provided in the data set and the useful semantic or behavioral similarity. 2. **Different definitions of program similarity**: - There are various definitions of "similarity", and different application scenarios (such as plagiarism detection, author identification, malware analysis, etc.) have different understandings of similarity. - There is a lack of unified standards to evaluate the performance of different tools and methods under these different definitions. 3. **Limitations of existing data sets**: - Existing data sets often cannot well represent the problems in the real world, resulting in poor reproducibility and representativeness of experimental results. - Many data sets are private or not fully public, which limits their scope of use. To solve these problems, the author proposes a framework named HELIX and its extended tool Blind HELIX to generate large - scale synthetic program similarity data sets. These data sets have known and configurable real labels and can better reflect the concept of program similarity in practical applications. ### Main contributions of HELIX and Blind HELIX 1. **First use of program slicing and recombination methods to generate synthetic data sets**: - By combining code fragments from open - source libraries into samples with known similarity, the authenticity and controllability of the data set are ensured. 2. **Open - source code framework HELIX**: - It provides a general program generation and mutation framework that supports multiple programming languages, compilers and build systems, and is suitable for data set generation in program similarity research. 3. **Tool Blind HELIX for automatically extracting functional components**: - It uses program slicing technology to automatically extract functional components from existing open - source libraries, greatly improving the efficiency and scale of data set generation. 4. **Evaluate and compare the performance of existing tools**: - Use the generated data sets to evaluate multiple existing program similarity tools and show the relative performance of these tools under different similarity definitions. Through these contributions, the author hopes to provide higher - quality and standardized data sets for program similarity research, thereby promoting further development and innovation in this field.

Synthetic Datasets for Program Similarity Research

Is Function Similarity Over-Engineered? Building a Benchmark

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Scalable Program Clone Search Through Spectral Analysis

Detecting Similar Repositories On Github

BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis

Synthetic Data, Similarity-based Privacy Metrics, and Regulatory (Non-)Compliance

Comparing Apples to Oranges: Learning Similarity Functions for Data Produced by Different Distributions

Benchmarking the Fidelity and Utility of Synthetic Relational Data

CLARITY -- Comparing heterogeneous data using dissimiLARITY

High-Level Synthetic Data Generation with Data Set Archetypes

HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond

A Massive Scale Semantic Similarity Dataset of Historical English

SIMpat: A Synthetic Benchmark for Similarity Metrics on Patient Representations

SynEva: Evaluating ML Programs by Mirror Program Synthesis

Synthetic Data: Methods, Use Cases, and Risks

Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones Via Deep Learning

Efficacy of Synthetic Data as a Benchmark

A primer on synthetic health data

Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction