Abstract:We introduce a new benchmark for assessing AI models' capabilities and risks in automated software exploitation, focusing on their ability to detect and exploit vulnerabilities in real-world software systems. Using DARPA's AI Cyber Challenge (AIxCC) framework and the Nginx challenge project, a deliberately modified version of the widely used Nginx web server, we evaluate several leading language models, including OpenAI's o1-preview and o1-mini, Anthropic's Claude-3.5-sonnet-20241022 and Claude-3.5-sonnet-20240620, Google DeepMind's Gemini-1.5-pro, and OpenAI's earlier GPT-4o model. Our findings reveal that these models vary significantly in their success rates and efficiency, with o1-preview achieving the highest success rate of 64.71 percent and o1-mini and Claude-3.5-sonnet-20241022 providing cost-effective but less successful alternatives. This benchmark establishes a foundation for systematically evaluating the AI cyber risk posed by automated exploitation tools.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities and risks of large - language models (LLMs) in automated software vulnerability detection and exploitation. Specifically, by introducing a new benchmarking framework, the paper aims to systematically evaluate these models' ability to discover and exploit vulnerabilities in real - world software systems. ### Main Problems 1. **Evaluating the Automated Vulnerability Exploitation Capability of AI Models**: - The paper focuses on how to evaluate the AI model's ability in automated software vulnerability detection and exploitation. This includes whether the model can identify and exploit vulnerabilities in actual software systems. 2. **Establishing Evaluation Criteria**: - The author hopes to provide a fair and challenging environment for evaluation by using DARPA's AI Cyber Challenge (AIxCC) framework and the modified Nginx Web server project (containing 17 carefully designed vulnerabilities). 3. **Exploring the Double - Edged Sword Effect**: - The application of LLMs in the field of network security has a dual nature: on the one hand, it can accelerate vulnerability detection and repair procedures and enhance security; on the other hand, it may also be used for malicious purposes, such as automated software attacks. Therefore, the paper also discusses the potential risks of these models and the response mechanisms. ### Methodology - **Selecting an Appropriate Test Platform**: The Nginx AIxCC challenge project was selected as the test platform because it provides a clear scope and complex real - world application scenarios and ensures that the test data is not included in the model training set. - **Iterative Improvement Mechanism**: Through the reﬂexion loop, the model can self - adjust and optimize according to previous failed attempts. - **Multi - Dimensional Evaluation**: The model's performance was comprehensively evaluated from multiple perspectives such as success rate, cost - efficiency, and adaptability. ### Results - **Significant Performance Differences**: There are obvious differences in the success rates among different models. Among them, o1 - preview performs the best, with a success rate of 64.71%; other models such as Claude - 3.5 - sonnet - 20240620 and Gemini - 1.5 - pro also show potential, but with lower success rates. - **Cost - Benefit Analysis**: Although some models have higher costs, they are more cost - effective overall due to their higher success rates. ### Significance This research not only reveals the current potential of LLMs in automated vulnerability exploitation but also emphasizes the need to be cautious in the application of these technologies to avoid possible security threats.

AI Cyber Risk Benchmark: Automated Exploitation Capabilities

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Countering Autonomous Cyber Threats

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Robust Testing of AI Language Model Resiliency with Novel Adversarial Prompts

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Open-Source Assessments of AI Capabilities: The Proliferation of AI Analysis Tools, Replicating Competitor Models, and the Zhousidun Dataset

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

AI-Augmented Ethical Hacking: A Practical Examination of Manual Exploitation and Privilege Escalation in Linux Environments

AI-Enhanced Ethical Hacking: A Linux-Focused Experiment

Sabotage Evaluations for Frontier Models

Responsible AI in Open Ecosystems: Reconciling Innovation with Risk Assessment and Disclosure

Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Generative AI for Cyber Security: Analyzing the Potential of ChatGPT, DALL-E, and Other Models for Enhancing the Security Space

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Cloud-based XAI Services for Assessing Open Repository Models Under Adversarial Attacks

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Evaluating Frontier Models for Dangerous Capabilities