sem2vec : Semantics-Aware Assembly Tracelet Embedding

Huaijin Wang,Pingchuan Ma,Shuai Wang,Qiyi Tang,Sen Nie,Shi Wu
DOI: https://doi.org/10.1145/3569933
IF: 3.685
2022-10-28
ACM Transactions on Software Engineering and Methodology
Abstract:Binary code similarity is the foundation of many security and software engineering applications. Recent works leverage deep neural networks (DNN) to learn a numeric vector representation (namely embeddings ) of assembly functions, enabling similarity analysis in the numeric space. However, existing DNN-based techniques capture syntactic-, control flow-, or data flow-level information of assembly code, which is too coarse-grained to represent program functionality. These methods can suffer from low robustness to challenging settings such as compiler optimizations and obfuscations. We present sem2vec , a binary code embedding framework that learns from semantics . Given the control-flow graph (CFG) of an assembly function, we divide it into tracelets , denoting continuous and short execution traces that are reachable from the function entry point. We use symbolic execution to extract symbolic constraints and other auxiliary information on each tracelet. We then train masked language models to compute embeddings of symbolic execution outputs. Last, we use graph neural networks, to aggregate tracelet embeddings into the CFG-level embedding for a function. Our evaluation shows that sem2vec extracts high-quality embedding and is robust against different compilers, optimizations, architectures, and popular obfuscation methods including virtualization obfuscation. We further augment a vulnerability search application with embeddings computed by sem2vec and demonstrate a significant improvement in vulnerability search accuracy.
computer science, software engineering
What problem does this paper attempt to address?