eyeballvul: a future-proof benchmark for vulnerability detection in the wild

Timothee Chauvin
2024-07-13
Abstract:Long contexts of recent LLMs have enabled a new use case: asking models to find security vulnerabilities in entire codebases. To evaluate model performance on this task, we introduce eyeballvul: a benchmark designed to test the vulnerability detection capabilities of language models at scale, that is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories. The benchmark consists of a list of revisions in different repositories, each associated with the list of known vulnerabilities present at that revision. An LLM-based scorer is used to compare the list of possible vulnerabilities returned by a model to the list of known vulnerabilities for each revision. As of July 2024, eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.
Cryptography and Security,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to evaluate the ability of large - language models (LLMs) to detect security vulnerabilities in large - scale source - code repositories. Specifically, the paper introduces a benchmark named **eyeballvul** for evaluating the performance of LLMs in detecting security vulnerabilities in real - world environments. The following are the main objectives and contributions of this paper: 1. **Fill the existing gap**: Currently, there are no benchmarks or datasets specifically for LLMs to detect security vulnerabilities in the entire codebase. eyeballvul fills this gap and provides a large - scale, diverse benchmark. 2. **Real - world vulnerabilities**: eyeballvul obtains data from the known - vulnerability (CVE) streams in open - source repositories, ensuring the authenticity and timeliness of the data. Weekly updated data enables the benchmark to keep up with the latest vulnerability releases. 3. **Large - scale and diversity**: As of July 2024, eyeballvul contains more than 24,000 vulnerabilities, involving more than 6,000 revisions and more than 5,000 repositories, with a total size of approximately 55GB. It is not limited to specific programming languages and covers a wide range of projects. 4. **Evaluation method**: By comparing the possible vulnerabilities predicted by the model with the known vulnerabilities, calculate metrics such as precision and recall to evaluate the performance of the model. Use an LLM - based scorer to automate this process. 5. **Future - proof**: The design of eyeballvul takes into account future extensibility, is regularly updated to avoid training - data pollution, and ensures its long - term effectiveness. ### Summary The main purpose of this paper is to establish a large - scale, real - world benchmark (eyeballvul) to evaluate the performance of LLMs in detecting security vulnerabilities in source - code repositories. This not only fills the gap in existing evaluation tools but also provides an important foundation for future research and applications.