eyeballvul: a future-proof benchmark for vulnerability detection in the wild

Timothee Chauvin

2024-07-13

Abstract:Long contexts of recent LLMs have enabled a new use case: asking models to find security vulnerabilities in entire codebases. To evaluate model performance on this task, we introduce eyeballvul: a benchmark designed to test the vulnerability detection capabilities of language models at scale, that is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories. The benchmark consists of a list of revisions in different repositories, each associated with the list of known vulnerabilities present at that revision. An LLM-based scorer is used to compare the list of possible vulnerabilities returned by a model to the list of known vulnerabilities for each revision. As of July 2024, eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.

Cryptography and Security,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to evaluate the ability of large - language models (LLMs) to detect security vulnerabilities in large - scale source - code repositories. Specifically, the paper introduces a benchmark named **eyeballvul** for evaluating the performance of LLMs in detecting security vulnerabilities in real - world environments. The following are the main objectives and contributions of this paper: 1. **Fill the existing gap**: Currently, there are no benchmarks or datasets specifically for LLMs to detect security vulnerabilities in the entire codebase. eyeballvul fills this gap and provides a large - scale, diverse benchmark. 2. **Real - world vulnerabilities**: eyeballvul obtains data from the known - vulnerability (CVE) streams in open - source repositories, ensuring the authenticity and timeliness of the data. Weekly updated data enables the benchmark to keep up with the latest vulnerability releases. 3. **Large - scale and diversity**: As of July 2024, eyeballvul contains more than 24,000 vulnerabilities, involving more than 6,000 revisions and more than 5,000 repositories, with a total size of approximately 55GB. It is not limited to specific programming languages and covers a wide range of projects. 4. **Evaluation method**: By comparing the possible vulnerabilities predicted by the model with the known vulnerabilities, calculate metrics such as precision and recall to evaluate the performance of the model. Use an LLM - based scorer to automate this process. 5. **Future - proof**: The design of eyeballvul takes into account future extensibility, is regularly updated to avoid training - data pollution, and ensures its long - term effectiveness. ### Summary The main purpose of this paper is to establish a large - scale, real - world benchmark (eyeballvul) to evaluate the performance of LLMs in detecting security vulnerabilities in source - code repositories. This not only fills the gap in existing evaluation tools but also provides an important foundation for future research and applications.

eyeballvul: a future-proof benchmark for vulnerability detection in the wild

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations.

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Fine-grained Commit-level Vulnerability Type Prediction by CWE Tree Structure.

Automated Unearthing of Dangerous Issue Reports.

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

How Far Have We Gone in Vulnerability Detection Using Large Language Models

Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

Vulnerability Detection with Code Language Models: How Far Are We?

Multitask-based Evaluation of Open-Source LLM on Software Vulnerability

Software Vulnerability and Functionality Assessment using LLMs

Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets

Automated software vulnerability detection with machine learning

Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries

CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics

Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis

VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python

An empirical study of text-based machine learning models for vulnerability detection