Abstract:Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with generic and specific milestones that allow us to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction. We compare their performance and limitations. For example, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate (SR) across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, with 64% of SR. AutoPenBench allows us also to observe how different LLMs like GPT-4o or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap with a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AutoPenBench along with the research community by making it available under <a class="link-external link-https" href="https://github.com/lucagioacchini/auto-pen-bench" rel="external noopener nofollow">this https URL</a>.

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments

Hacking, The Lazy Way: LLM Augmented Pentesting

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Hacking CTFs with Plain Agents

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

PentestAgent: Incorporating LLM Agents to Automated Penetration Testing

CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Getting pwn'd by AI: Penetration Testing with Large Language Models

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

SoK: Prompt Hacking of Large Language Models