Abstract:Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match.

Investigating Neural-based Function Name Reassignment from the Perspective of Binary Code Representation

A Lightweight Framework for Function Name Reassignment Based on Large-Scale Stripped Binaries

Boosting Neural Networks to Decompile Optimized Binaries

RENN: Efficient Reverse Execution with Neural-Network-assisted Alias Analysis

Neural reverse engineering of stripped binaries using augmented control flow graphs

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Semantics-Recovering Decompilation through Neural Machine Translation

Two Sides of the Same Coin: Exploiting the Impact of Identifiers in Neural Code Comprehension

How Important Are Good Method Names in Neural Code Generation? A Model Robustness Perspective

Beyond the C: Retargetable Decompilation using Neural Machine Translation

Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-Task Learning

Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary

Neural-FEBI: Accurate Function Identification in Ethereum Virtual Machine Bytecode

IRaDT: LLVM IR as Target for Efficient Neural Decompilation

Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and Fusion

Redundancy and Concept Analysis for Code-trained Language Models

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Llasm: Naming Functions in Binaries by Fusing Encoder-only and Decoder-only LLMs

Binary code similarity analysis based on naming function and common vector space

On Training a Neural Network to Explain Binaries