Abstract:Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage can be downloaded from

What problem does this paper attempt to address?

The paper aims to address the issues of dataset construction and quality in the field of machine learning for binary code analysis, particularly concerning Windows PE binary files. The main objectives include: 1. **Constructing high-quality datasets**: Due to challenges in obtaining high-quality benign binary files in modern systems (especially in Windows PE format), such as copyright and licensing issues, the study proposes an automated binary dataset construction system called Assemblage. It can crawl, configure, and compile Windows PE binary files to generate high-quality datasets suitable for training state-of-the-art models. 2. **Addressing data availability issues**: A long-standing critical issue in the application of machine learning to binary analysis is the scale and diversity of datasets. Assemblage addresses the issues of dataset size, diversity, and legality by automatically downloading and compiling code from open-source projects like GitHub, particularly targeting Windows PE binary files. 3. **Scalability and reproducibility**: Assemblage is designed to be scalable and reproducible, allowing users to publish "recipes" for datasets. It can adapt to different code sources, compilers, and feature extractors, promoting research reproducibility and further customization. 4. **Large-scale dataset generation**: By running Assemblage on AWS for 1 year, over 890,000 Windows PE and 428,000 Linux ELF binary files were generated, covering 29 different configuration combinations. This provides unprecedented large-scale data support for machine learning models. 5. **Empirical evaluation**: The paper also evaluates the application of Assemblage in three case studies on compiler provenance detection and function similarity identification tasks, demonstrating its practical value and demand for Windows PE binary file datasets. In summary, the core contribution of this paper lies in developing an automated, large-scale, and highly scalable system for constructing high-quality binary datasets, with a particular emphasis on addressing the availability and quality challenges of Windows PE binary files to advance machine learning in the field of binary analysis.

Assemblage: Automatic Binary Dataset Construction for Machine Learning

SoK: All You Ever Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask

FMDiv: Functional Module Division on Binary Malware for Accurate Malicious Code Localization.

EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

Towards usable automated detection of CPU architecture and endianness for arbitrary binary files and object code sequences

UniBin: Assembly Semantic-enhanced Binary Vulnerability Detection without Disassembly

On the Generation of Disassembly Ground Truth and the Evaluation of Disassemblers

ROMEO: Exploring Juliet through the Lens of Assembly Language

Ground Truth for Binary Disassembly is Not Easy

Is Function Similarity Over-Engineered? Building a Benchmark

Cornucopia: A Framework for Feedback Guided Generation of Binaries

Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries

BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching

Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery

Quo Vadis: Hybrid Machine Learning Meta-Model based on Contextual and Behavioral Malware Representations

BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis

On Training a Neural Network to Explain Binaries

How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models

BinProv: Binary Code Provenance Identification Without Disassembly.

Leveraging Artificial Intelligence on Binary Code Comprehension