Abstract:Recently, Automated Vulnerability Localization (AVL) has attracted much attention, aiming to facilitate diagnosis by pinpointing the lines of code responsible for discovered vulnerabilities. Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in vulnerability localization remains underexplored. In this work, we perform the first comprehensive study of LLMs for AVL. Our investigation encompasses 10+ leading LLMs suitable for code analysis, including ChatGPT and various open-source models, across three architectural types: encoder-only, encoder-decoder, and decoder-only, with model sizes ranging from 60M to 16B parameters. We explore the efficacy of these LLMs using 4 distinct paradigms: zero-shot learning, one-shot learning, discriminative fine-tuning, and generative fine-tuning. Our evaluation framework is applied to the BigVul-based dataset for C/C++, and an additional dataset comprising smart contract vulnerabilities. The results demonstrate that discriminative fine-tuning of LLMs can significantly outperform existing learning-based methods for AVL, while other paradigms prove less effective or unexpectedly ineffective for the task. We also identify challenges related to input length and unidirectional context in fine-tuning processes for encoders and decoders. We then introduce two remedial strategies: the sliding window and the right-forward embedding, both of which substantially enhance performance. Furthermore, our findings highlight certain generalization capabilities of LLMs across Common Weakness Enumerations (CWEs) and different projects, indicating a promising pathway toward their practical application in vulnerability localization.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the effectiveness of large - language models (LLMs) in Automated Vulnerability Localization (AVL). Specifically, the paper explores how to use large - language models to accurately locate the positions of vulnerabilities in code, so as to assist developers in diagnosing and fixing security issues in software more efficiently. Currently, although there are multiple vulnerability detection methods, these methods are often accompanied by problems such as a high false - positive rate and difficulty in clearly indicating the specific locations of vulnerabilities, resulting in delays for developers in solving these problems. Therefore, the paper focuses on evaluating the performance of different types of large - language models in vulnerability - location tasks, exploring their potential, and proposing improvement strategies, with the aim of enhancing the application effects of large - language models in this field. Through comparative experiments, the paper evaluates the performance of more than 10 leading large - scale language models in vulnerability - location tasks, covering different architecture types (such as encoder - only, encoder - decoder, and decoder - only), as well as different parameter scales (from 60M to 16B). The research not only examines the effects of zero - shot learning and one - shot learning but also deeply explores methods for enhancing model performance through discriminative fine - tuning and generative fine - tuning. In addition, the paper also analyzes the robustness of models in different types of vulnerabilities (classified according to the Common Weakness Enumeration) and cross - project analysis, and proposes improvement strategies for input - length limitations and one - way context problems, such as the sliding - window technique and the right - directed embedding technique. In summary, this paper aims to fill the research gap in the application of large - language models in the field of automated vulnerability localization. Through comprehensive experimental analysis, it reveals the potential and challenges of large - language models in this task and provides possible solutions.

An Empirical Study of Automated Vulnerability Localization with Large Language Models

How Far Have We Gone in Vulnerability Detection Using Large Language Models

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Multitask-based Evaluation of Open-Source LLM on Software Vulnerability

Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

Attention Is All You Need for LLM-based Code Vulnerability Localization

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

VTT-LLM: Advancing Vulnerability-to-Tactic-and-Technique Mapping through Fine-Tuning of Large Language Model

Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead

ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

Vul-LMGNNs: Fusing Language Models and Online-Distilled Graph Neural Networks for Code Vulnerability Detection

Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

LLbezpeky: Leveraging Large Language Models for Vulnerability Detection

Software Vulnerability and Functionality Assessment using LLMs

LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning

RealVul: Can We Detect Vulnerabilities in Web Applications with LLM?