Assemblage: Automatic Binary Dataset Construction for Machine Learning

Chang Liu,Rebecca Saul,Yihao Sun,Edward Raff,Maya Fuchs,Townsend Southard Pantano,James Holt,Kristopher Micinski
2024-05-07
Abstract:Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage can be downloaded from
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issues of dataset construction and quality in the field of machine learning for binary code analysis, particularly concerning Windows PE binary files. The main objectives include: 1. **Constructing high-quality datasets**: Due to challenges in obtaining high-quality benign binary files in modern systems (especially in Windows PE format), such as copyright and licensing issues, the study proposes an automated binary dataset construction system called Assemblage. It can crawl, configure, and compile Windows PE binary files to generate high-quality datasets suitable for training state-of-the-art models. 2. **Addressing data availability issues**: A long-standing critical issue in the application of machine learning to binary analysis is the scale and diversity of datasets. Assemblage addresses the issues of dataset size, diversity, and legality by automatically downloading and compiling code from open-source projects like GitHub, particularly targeting Windows PE binary files. 3. **Scalability and reproducibility**: Assemblage is designed to be scalable and reproducible, allowing users to publish "recipes" for datasets. It can adapt to different code sources, compilers, and feature extractors, promoting research reproducibility and further customization. 4. **Large-scale dataset generation**: By running Assemblage on AWS for 1 year, over 890,000 Windows PE and 428,000 Linux ELF binary files were generated, covering 29 different configuration combinations. This provides unprecedented large-scale data support for machine learning models. 5. **Empirical evaluation**: The paper also evaluates the application of Assemblage in three case studies on compiler provenance detection and function similarity identification tasks, demonstrating its practical value and demand for Windows PE binary file datasets. In summary, the core contribution of this paper lies in developing an automated, large-scale, and highly scalable system for constructing high-quality binary datasets, with a particular emphasis on addressing the availability and quality challenges of Windows PE binary files to advance machine learning in the field of binary analysis.