SepBIN: Binary Feature Separation for Better Semantic Comparison and Authorship Verification

Qige Song,Yafei Sang,Yongzheng Zhang,Shuhao Li,Xiaolin Xu
DOI: https://doi.org/10.1109/tifs.2023.3331895
IF: 7.231
2024-01-01
IEEE Transactions on Information Forensics and Security
Abstract:Binary semantic comparison and authorship verification are critical in many security applications. They respectively focus on the functional semantic features and developers’ programming style features of binary code, which are usually mixed without clear demarcation. Recently, researchers have proposed learning-based approaches for intelligent binary analysis. They generally addressed single tasks with hand-crafted feature sets or neural binary encoders, which suffer performance bottlenecks due to the noise in mixed features. This paper proposes SepBIN , a novel neural network framework that exploits the intrinsic correlation of binary semantic comparison and authorship verification tasks and automatically separates semantic and stylistic binary features. We first construct a strong backbone binary encoder, then utilize preliminary decomposition subnets and the flexible gating-based feature fusion mechanism to distill pure semantic-related and style-related binary representations, and further improve their quality by a feature reconstruction module. The overall SepBIN model is optimized by a multi-objective joint optimization strategy. We conduct extensive experiments on Google Code Jam (GCJ) datasets in different languages and scales. Results show that SepBIN simultaneously benefits binary semantic comparison and authorship verification tasks through the effective binary semantic-style feature separation mechanism, and provides multi-perspectives interpretability for the performance gains. For state-of-the-art approaches with different binary encoders, SepBIN can adaptively improve them with the designed separation modules. Furthermore, we adopt a pretraining-finetuning strategy to effectively transfer SepBIN ’s separation capability in real-world applications, including APT malware homology detection and binary semantic comparison against code obfuscations.
What problem does this paper attempt to address?