LLbezpeky: Leveraging Large Language Models for Vulnerability Detection

Noble Saji Mathews,Yelizaveta Brus,Yousra Aafer,Meiyappan Nagappan,Shane McIntosh

2024-02-14

Abstract:Despite the continued research and progress in building secure systems, Android applications continue to be ridden with vulnerabilities, necessitating effective detection methods. Current strategies involving static and dynamic analysis tools come with limitations like overwhelming number of false positives and limited scope of analysis which make either difficult to adopt. Over the past years, machine learning based approaches have been extensively explored for vulnerability detection, but its real-world applicability is constrained by data requirements and feature engineering challenges. Large Language Models (LLMs), with their vast parameters, have shown tremendous potential in understanding semnatics in human as well as programming languages. We dive into the efficacy of LLMs for detecting vulnerabilities in the context of Android security. We focus on building an AI-driven workflow to assist developers in identifying and rectifying vulnerabilities. Our experiments show that LLMs outperform our expectations in finding issues within applications correctly flagging insecure apps in 91.67% of cases in the Ghera benchmark. We use inferences from our experiments towards building a robust and actionable vulnerability detection system and demonstrate its effectiveness. Our experiments also shed light on how different various simple configurations can affect the True Positive (TP) and False Positive (FP) rates.

Cryptography and Security,Artificial Intelligence,Software Engineering

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore how to utilize large language models (LLMs) to detect vulnerabilities in Android applications and establish an AI-driven workflow to help developers identify and fix these vulnerabilities. #### Main Objectives 1. **Evaluate the capability of LLMs to detect Android vulnerabilities under basic prompt techniques**: - Compare the effectiveness of LLMs with existing tools. - Determine which types of vulnerabilities LLMs perform better on. - Explore whether fine-tuning the model or embeddings is necessary for better results. 2. **Identify the types of inputs required by the system**: - How to provide additional contextual information? Which knowledge bases help in discovering complex vulnerabilities? - Can existing solutions or static analysis tools be used in conjunction with LLMs? #### Experimental Methods - Use GPT-4 for experiments, with the Ghera benchmark dataset. - Experiment with different prompt techniques and ways of providing context, including basic prompts, providing vulnerability summaries, and on-demand file content requests. - Analyze the results of different experiments to optimize prompt engineering and retrieval-augmented generation techniques. #### Results and Discussion - Under basic prompt techniques, GPT-4 was able to flag some applications as insecure without detailed vulnerability descriptions. - Providing brief vulnerability descriptions significantly reduced the mislabeling of secure applications as insecure. - The on-demand file content request method, while cost-saving, sacrificed some report quality. #### Future Work - Explore more structured multi-agent pipelines to improve performance. - Share information between scanners to reduce resource consumption. - Conduct empirical studies to compare the results of LLMs with other existing methods. - Consider integrating static analysis techniques to further enhance detection effectiveness.

LLbezpeky: Leveraging Large Language Models for Vulnerability Detection

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Assessing the Effectiveness of LLMs in Android Application Vulnerability Analysis

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

How Far Have We Gone in Vulnerability Detection Using Large Language Models

A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection

Towards Effectively Detecting and Explaining Vulnerabilities Using Large Language Models

Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Harnessing the Power of LLMs in Source Code Vulnerability Detection

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

CVE-LLM : Automatic vulnerability evaluation in medical device industry using large language models

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

Software Vulnerability and Functionality Assessment using LLMs

Exploring Vulnerabilities and Threats in Large Language Models: Safeguarding Against Exploitation and Misuse

Emerging Security Challenges of Large Language Models

Strengthening LLM ecosystem security: Preventing mobile malware from manipulating LLM-based applications

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

An Empirical Study of Automated Vulnerability Localization with Large Language Models