Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries

Dylan Manuel,Nafis Tanveer Islam,Joseph Khoury,Ana Nunez,Elias Bou-Harb,Peyman Najafirad
2024-11-08
Abstract:Security experts reverse engineer (decompile) binary code to identify critical security vulnerabilities. The limited access to source code in vital systems - such as firmware, drivers, and proprietary software used in Critical Infrastructures (CI) - makes this analysis even more crucial on the binary level. Even with available source code, a semantic gap persists after compilation between the source and the binary code executed by the processor. This gap may hinder the detection of vulnerabilities in source code. That being said, current research on Large Language Models (LLMs) overlooks the significance of decompiled binaries in this area by focusing solely on source code. In this work, we are the first to empirically uncover the substantial semantic limitations of state-of-the-art LLMs when it comes to analyzing vulnerabilities in decompiled binaries, largely due to the absence of relevant datasets. To bridge the gap, we introduce DeBinVul, a novel decompiled binary code vulnerability dataset. Our dataset is multi-architecture and multi-optimization, focusing on C/C++ due to their wide usage in CI and association with numerous vulnerabilities. Specifically, we curate 150,872 samples of vulnerable and non-vulnerable decompiled binary code for the task of (i) identifying; (ii) classifying; (iii) describing vulnerabilities; and (iv) recovering function names in the domain of decompiled binaries. Subsequently, we fine-tune state-of-the-art LLMs using DeBinVul and report on a performance increase of 19%, 24%, and 21% in the capabilities of CodeLlama, Llama3, and CodeGen2 respectively, in detecting binary code vulnerabilities. Additionally, using DeBinVul, we report a high performance of 80-90% on the vulnerability classification task. Furthermore, we report improved performance in function name recovery and vulnerability description tasks.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when conducting vulnerability analysis in decompiled binary code, existing large - language models (LLMs) have significant semantic limitations. Specifically, the performance of these models in handling decompiled binary code is far inferior to that in handling source code. This is mainly because of the lack of relevant datasets to train these models, resulting in their poor performance in tasks such as identifying, classifying, and describing vulnerabilities and restoring function names. To address this challenge, the authors of the paper propose the following research objectives: 1. **Empirical Investigation**: First, the authors reveal the vulnerability semantic gap in existing state - of - the - art LLMs when handling decompiled binary code through empirical research. The research shows that the performance of these models in handling decompiled binary code is significantly lower than that in handling source code. 2. **Constructing a Dataset**: Second, the authors introduce a new decompiled binary code vulnerability dataset - DeBinVul. This dataset contains 150,872 samples, covering multiple architectures (x86, x64, ARM, MIPS) and optimization levels (O0, O3), and each sample is labeled with a vulnerability type (CWE category). These samples include vulnerable and non - vulnerable decompiled binary code. 3. **Model Fine - Tuning and Optimization**: Finally, the authors use the DeBinVul dataset to fine - tune existing LLMs and evaluate the performance improvement of the fine - tuned models in tasks such as vulnerability detection, classification, description, and function name restoration. The results show that after fine - tuning, the performance of CodeLlama, Llama 3, and CodeGen2 has increased by 19%, 24%, and 21% respectively. ### Main Contributions 1. **First Empirical Research**: As far as the authors know, this is the first empirical research on the vulnerability semantic gap of LLMs when handling decompiled binary code. The research results show that the performance of existing models in handling decompiled binary code is poor. 2. **Constructing a New Dataset**: The authors construct and release the DeBinVul dataset, which contains 150,872 samples and aims to solve four important binary code vulnerability analysis tasks: vulnerability detection, classification, description, and function name restoration. 3. **Performance Improvement**: By using the DeBinVul dataset to fine - tune existing LLMs, the authors significantly improve the performance of these models in decompiled binary code vulnerability analysis. In particular, the performance of CodeLlama, Llama 3, and CodeGen2 has increased by 19%, 24%, and 21% respectively. ### Evaluation Metrics To evaluate the performance of the models, the authors use a variety of task - specific metrics. For example, in vulnerability detection and classification tasks, accuracy, precision, recall, and F1 score are used. In function name prediction and code description generation tasks, metrics such as BLEU, Rouge - L, BERTScore, and semantic similarity are used. ### Experimental Analysis In terms of experimental setup, the authors divide the DeBinVul dataset into 80% training set, 10% validation set, and 10% test set. The training data includes source code from the NVD dataset as of December 2021 to ensure that the test data always follows the time sequence of the training data. All benchmark models are trained in an environment equipped with an NVIDIA DGX server, which is equipped with an AMD EPYC 7742 64 - core processor, 1TB of memory, and 8 NVIDIA A100 GPUs. The models are trained for 4 epochs, with a maximum token length of 512, a learning rate of 2e - 5, a batch size of 4, a beam size of 1 for generation tasks, and a temperature value of 1.0. ### Research Question Answering 1. **RQ1: Vulnerability Identification and Classification** - The F1 score of the models trained with the DeBinVul dataset in the vulnerability identification task has increased by 18% or more. In particular, the accuracy of CodeGen2 and LLaMa 3 in this task has reached 91%, which is about 30% higher than that of the baseline model. - In the vulnerability classification task, the F1 score of all baseline models is lower than 5%.