Abstract:Provenance identification, which is essential for binary analysis, aims to uncover the specific compiler and configuration used for generating the executable. Traditionally, the existing solutions extract syntactic, structural, and semantic features from disassembled programs and employ machine learning techniques to identify the compilation provenance of binaries. However, their effectiveness heavily relies on disassembly tools (e.g., IDA Pro) and tedious feature engineering, since it is challenging to obtain accurate assembly code, particularly, from the stripped or obfuscated binaries. In addition, the features in machine learning approaches are manually selected based on the domain knowledge of one specific architecture, which cannot be applied to other architectures. In this paper, we develop an end-to-end provenance identification system BinProv, which leverages a BERT (Bidirectional Encoder Representations from Transformers) based embedding model to learn and represent the context semantics and syntax directly from the binary code. Therefore, BinProv avoids the disassembling step and manual feature selection in provenance identification. Moreover, BinProv can distinguish the compilers and the four optimization levels (O0/O1/O2/O3) by fine-tuning the classifier model with the embedding inputs for specific provenance identification tasks. Experimental results show that BinProv achieves 92.14%, 99.4%, and 99.8% accuracy at byte sequence, function, and binary levels, respectively. We further demonstrate that BinProv works well on obfuscated binary code, suggesting that BinProv is a viable approach to remarkably mitigate the disassembler dependence in future provenance identification tasks. Finally, our case studies show that BinProv can better identify compiler helper functions and improve the performance of binary code similarity detection.

BinPRE: Enhancing Field Inference in Binary Analysis Based Protocol Reverse Engineering

ABInfer: A Novel Field Boundaries Inference Approach for Protocol Reverse Engineering

Automatic Protocol Reverse Engineering for Industrial Control Systems with Dynamic Taint Analysis

DynPRE: Protocol Reverse Engineering Via Dynamic Inference

Pre-decision Detection Engine for Signature-Based Network Intrusion Detection System

Automatic State Machine Inference for Binary Protocol Reverse Engineering

PREIUD: An Industrial Control Protocols Reverse Engineering Tool Based on Unsupervised Learning and Deep Neural Network Methods

REACT: IR-Level Patch Presence Test for Binary

PS3: Precise Patch Presence Test Based on Semantic Symbolic Signature

PreInfer: Automatic Inference of Preconditions via Symbolic Analysis

ProInfer: inference of binary protocol keywords based on probabilistic statistics

Reverse Engineering Industrial Protocols Driven by Control Fields

BinProv: Binary Code Provenance Identification Without Disassembly.

MDIplier: Protocol Format Recovery Via Hierarchical Inference

Sub-messages extraction for industrial control protocol reverse engineering

Lifting Network Protocol Implementation to Precise Format Specification with Security Applications

Electrophysiological study of the normal and pathological human cochlea. I. Presynaptic potentials.

AIFORE: Smart Fuzzing Based on Automatic Input Format Reverse Engineering

Hpress: A Hardware-Enhanced Proxy Re-Encryption Scheme Using Secure Enclave.

BugPre: an intelligent software version-to-version bug prediction system using graph convolutional neural networks

DIComP: Lightweight Data-Driven Inference of Binary Compiler Provenance with High Accuracy