Abstract:Security experts reverse engineer (decompile) binary code to identify critical security vulnerabilities. The limited access to source code in vital systems - such as firmware, drivers, and proprietary software used in Critical Infrastructures (CI) - makes this analysis even more crucial on the binary level. Even with available source code, a semantic gap persists after compilation between the source and the binary code executed by the processor. This gap may hinder the detection of vulnerabilities in source code. That being said, current research on Large Language Models (LLMs) overlooks the significance of decompiled binaries in this area by focusing solely on source code. In this work, we are the first to empirically uncover the substantial semantic limitations of state-of-the-art LLMs when it comes to analyzing vulnerabilities in decompiled binaries, largely due to the absence of relevant datasets. To bridge the gap, we introduce DeBinVul, a novel decompiled binary code vulnerability dataset. Our dataset is multi-architecture and multi-optimization, focusing on C/C++ due to their wide usage in CI and association with numerous vulnerabilities. Specifically, we curate 150,872 samples of vulnerable and non-vulnerable decompiled binary code for the task of (i) identifying; (ii) classifying; (iii) describing vulnerabilities; and (iv) recovering function names in the domain of decompiled binaries. Subsequently, we fine-tune state-of-the-art LLMs using DeBinVul and report on a performance increase of 19%, 24%, and 21% in the capabilities of CodeLlama, Llama3, and CodeGen2 respectively, in detecting binary code vulnerabilities. Additionally, using DeBinVul, we report a high performance of 80-90% on the vulnerability classification task. Furthermore, we report improved performance in function name recovery and vulnerability description tasks.

Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary

Semantics-Recovering Decompilation through Neural Machine Translation

Revisiting Deep Learning for Variable Type Recovery

Investigating Neural-based Function Name Reassignment from the Perspective of Binary Code Representation

Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries

Boosting Neural Networks to Decompile Optimized Binaries

VarGAN: Adversarial Learning of Variable Semantic Representations

How Important Are Good Method Names in Neural Code Generation? A Model Robustness Perspective

A Lightweight Framework for Function Name Reassignment Based on Large-Scale Stripped Binaries

Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-Task Learning

LLM4Decompile: Decompiling Binary Code with Large Language Models

Neural reverse engineering of stripped binaries using augmented control flow graphs

HexT5: Unified Pre-Training for Stripped Binary Code Information Inference.

Generating Data for Symbolic Language with Large Language Models

Llasm: Naming Functions in Binaries by Fusing Encoder-only and Decoder-only LLMs

Can Neural Decompilation Assist Vulnerability Prediction on Binary Code?

Guess & Sketch: Language Model Guided Transpilation

Nimbus++: Revisiting Efficient Function Signature Recovery with Depth Data Analysis.

STRIDE: Simple Type Recognition In Decompiled Executables

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning