Abstract:Originating from semantic bugs, Entity-Inconsistency Bugs (EIBs) involve misuse of syntactically valid yet incorrect program entities, such as variable identifiers and function names, which often have security implications. Unlike straightforward syntactic vulnerabilities, EIBs are subtle and can remain undetected for years. Traditional detection methods, such as static analysis and dynamic testing, often fall short due to the versatile and context-dependent nature of EIBs. However, with advancements in Large Language Models (LLMs) like GPT-4, we believe LLM-powered automatic EIB detection becomes increasingly feasible through these models' semantics understanding abilities. This research first undertakes a systematic measurement of LLMs' capabilities in detecting EIBs, revealing that GPT-4, while promising, shows limited recall and precision that hinder its practical application. The primary problem lies in the model's tendency to focus on irrelevant code snippets devoid of EIBs. To address this, we introduce a novel, cascaded EIB detection system named WitheredLeaf, which leverages smaller, code-specific language models to filter out most negative cases and mitigate the problem, thereby significantly enhancing the overall precision and recall. We evaluated WitheredLeaf on 154 Python and C GitHub repositories, each with over 1,000 stars, identifying 123 new flaws, 45% of which can be exploited to disrupt the program's normal operations. Out of 69 submitted fixes, 27 have been successfully merged.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the **detection of Entity - Inconsistency Bugs (EIBs)**. EIBs refer to the misuse of code entities (such as variable names, function names, etc.) in a program that are syntactically correct but semantically wrong. These errors often have security implications and may lead to problems such as denial - of - service, and compromised control - flow and data - flow integrity. ### Specific Challenges of the Problem 1. **Limitations of Traditional Detection Methods**: - **Static Analysis**: It is difficult to capture EIBs that are highly context - dependent and flexible. - **Dynamic Testing** (such as fuzz testing): Although it can discover some errors, due to low code coverage, it is easy to miss most EIBs. - **Memory Detection Tools** (such as AddressSanitizer): It cannot effectively capture EIBs that only cause logical errors without memory corruption. 2. **Ineffectiveness of Existing Methods**: - EIBs may remain undetected for many years. For example, the bug in Figure 1 existed in a popular GitHub repository for about 7 years before being discovered. ### The Method Proposed in the Paper To address these problems, the paper proposes a new system named **WitheredLeaf**, which utilizes the powerful semantic understanding capabilities of large language models (LLMs) such as GPT - 4 to detect EIBs. Specifically: 1. **Preliminary Measurement Study**: - The study first systematically measures the ability of LLMs (especially GPT - 4) in detecting EIBs. The results show that although GPT - 4 has potential, it still has deficiencies in recall and precision. The main problem is that the model is easily distracted by irrelevant code fragments. 2. **Improvement Plan**: - A hierarchical EIB detection system, WitheredLeaf, is introduced. By using smaller, code - specific language models (such as CodeBERT and Code Llama) to filter out most of the negative examples, the overall precision and recall are significantly improved. - The specific process includes: - Using static analysis to identify all code entities. - Using CodeBERT to perform the first - pass filling task, predicting missing code entities and recording failed predictions. - Submitting the suspicious locations to the more powerful Code Llama for in - depth analysis. - Finally, uploading the locations involving inconsistent entities to GPT - 4 for detailed EIB analysis, and reducing false positives and false negatives through novel prompt engineering techniques. ### Experimental Results - In 80 Python and 74 C GitHub repositories, WitheredLeaf discovered 93 new Python bugs and 30 new C bugs, 45% of which can be exploited to disrupt the normal operation of the program. - 69 fix requests were submitted, and 27 of them have been merged. ### Summary The main contributions of the paper are: 1. **Understanding the Capabilities and Limitations of LLMs in EIB Detection**. 2. **Designing and Implementing the Efficient WitheredLeaf System**. 3. **Discovering New Vulnerabilities and Providing Fixes**. 4. **Constructing a Comprehensive EIB Dataset** to promote future research. Through these efforts, the paper provides a new and more effective solution for the automatic detection of EIBs.

WitheredLeaf: Finding Entity-Inconsistency Bugs with LLMs

Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach

Automatically Inspecting Thousands of Static Bug Warnings with Large Language Model: How Far Are We?

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks

LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning

Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs

LProtector: An LLM-driven Vulnerability Detection System

Utilizing Precise and Complete Code Context to Guide LLM in Automatic False Positive Mitigation

LLM-Powered Test Case Generation for Detecting Tricky Bugs

Security Attacks on LLM-based Code Completion Tools

LMs: Understanding Code Syntax and Semantics for Code Analysis

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models

A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification