WitheredLeaf: Finding Entity-Inconsistency Bugs with LLMs

Hongbo Chen,Yifan Zhang,Xing Han,Huanyao Rong,Yuheng Zhang,Tianhao Mao,Hang Zhang,XiaoFeng Wang,Luyi Xing,Xun Chen
2024-05-03
Abstract:Originating from semantic bugs, Entity-Inconsistency Bugs (EIBs) involve misuse of syntactically valid yet incorrect program entities, such as variable identifiers and function names, which often have security implications. Unlike straightforward syntactic vulnerabilities, EIBs are subtle and can remain undetected for years. Traditional detection methods, such as static analysis and dynamic testing, often fall short due to the versatile and context-dependent nature of EIBs. However, with advancements in Large Language Models (LLMs) like GPT-4, we believe LLM-powered automatic EIB detection becomes increasingly feasible through these models' semantics understanding abilities. This research first undertakes a systematic measurement of LLMs' capabilities in detecting EIBs, revealing that GPT-4, while promising, shows limited recall and precision that hinder its practical application. The primary problem lies in the model's tendency to focus on irrelevant code snippets devoid of EIBs. To address this, we introduce a novel, cascaded EIB detection system named WitheredLeaf, which leverages smaller, code-specific language models to filter out most negative cases and mitigate the problem, thereby significantly enhancing the overall precision and recall. We evaluated WitheredLeaf on 154 Python and C GitHub repositories, each with over 1,000 stars, identifying 123 new flaws, 45% of which can be exploited to disrupt the program's normal operations. Out of 69 submitted fixes, 27 have been successfully merged.
Cryptography and Security,Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the **detection of Entity - Inconsistency Bugs (EIBs)**. EIBs refer to the misuse of code entities (such as variable names, function names, etc.) in a program that are syntactically correct but semantically wrong. These errors often have security implications and may lead to problems such as denial - of - service, and compromised control - flow and data - flow integrity. ### Specific Challenges of the Problem 1. **Limitations of Traditional Detection Methods**: - **Static Analysis**: It is difficult to capture EIBs that are highly context - dependent and flexible. - **Dynamic Testing** (such as fuzz testing): Although it can discover some errors, due to low code coverage, it is easy to miss most EIBs. - **Memory Detection Tools** (such as AddressSanitizer): It cannot effectively capture EIBs that only cause logical errors without memory corruption. 2. **Ineffectiveness of Existing Methods**: - EIBs may remain undetected for many years. For example, the bug in Figure 1 existed in a popular GitHub repository for about 7 years before being discovered. ### The Method Proposed in the Paper To address these problems, the paper proposes a new system named **WitheredLeaf**, which utilizes the powerful semantic understanding capabilities of large language models (LLMs) such as GPT - 4 to detect EIBs. Specifically: 1. **Preliminary Measurement Study**: - The study first systematically measures the ability of LLMs (especially GPT - 4) in detecting EIBs. The results show that although GPT - 4 has potential, it still has deficiencies in recall and precision. The main problem is that the model is easily distracted by irrelevant code fragments. 2. **Improvement Plan**: - A hierarchical EIB detection system, WitheredLeaf, is introduced. By using smaller, code - specific language models (such as CodeBERT and Code Llama) to filter out most of the negative examples, the overall precision and recall are significantly improved. - The specific process includes: - Using static analysis to identify all code entities. - Using CodeBERT to perform the first - pass filling task, predicting missing code entities and recording failed predictions. - Submitting the suspicious locations to the more powerful Code Llama for in - depth analysis. - Finally, uploading the locations involving inconsistent entities to GPT - 4 for detailed EIB analysis, and reducing false positives and false negatives through novel prompt engineering techniques. ### Experimental Results - In 80 Python and 74 C GitHub repositories, WitheredLeaf discovered 93 new Python bugs and 30 new C bugs, 45% of which can be exploited to disrupt the normal operation of the program. - 69 fix requests were submitted, and 27 of them have been merged. ### Summary The main contributions of the paper are: 1. **Understanding the Capabilities and Limitations of LLMs in EIB Detection**. 2. **Designing and Implementing the Efficient WitheredLeaf System**. 3. **Discovering New Vulnerabilities and Providing Fixes**. 4. **Constructing a Comprehensive EIB Dataset** to promote future research. Through these efforts, the paper provides a new and more effective solution for the automatic detection of EIBs.