Abstract:Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with generic and specific milestones that allow us to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction. We compare their performance and limitations. For example, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate (SR) across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, with 64% of SR. AutoPenBench allows us also to observe how different LLMs like GPT-4o or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap with a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AutoPenBench along with the research community by making it available under <a class="link-external link-https" href="https://github.com/lucagioacchini/auto-pen-bench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of a comprehensive and standardized evaluation framework in the field of automated penetration testing. Although Generative AI agents have shown great potential in automating cybersecurity tasks, in the complex and strategy - diverse field of penetration testing, existing methods still have significant shortcomings, especially the lack of unified standards when evaluating, comparing, and developing these agents. To solve these problems, the author introduced **AUTOPENBENCH**, an open - benchmarking platform aimed at evaluating the performance of generative agents in automated penetration testing. Specifically, AUTOPENBENCH fills the gaps in existing methods in the following ways: 1. **Providing a diverse set of tasks**: It contains 33 tasks at different difficulty levels, and each task represents a vulnerable system to be attacked. These tasks include both in - vitro scenarios and real - world scenarios, ensuring the extensiveness and practicality of the test. 2. **Defining general and specific milestones**: To objectively measure the performance of agents, the author defined a series of general and specific milestones. These milestones allow anyone to compare results in a standardized way and understand the limitations of the agents being tested. 3. **Supporting multiple types of agent architectures**: By comparing fully autonomous agents and semi - autonomous agents (supporting human - machine interaction), it shows the advantages and disadvantages of different architectures. For example, the experimental results show that the success rate of fully autonomous agents in all tasks is 21%, and with human assistance, the success rate increases to 64%. 4. **Evaluating the impact of different large - language models (LLMs)**: It studied the impact of different LLMs (such as GPT - 4o, Gemini Flash, or OpenAI o1) on the agents' ability to complete tasks, revealing how the inherent randomness of LLMs affects the reliability of penetration - testing tasks. In conclusion, by constructing AUTOPENBENCH, this paper provides a standardized and flexible framework that enables researchers to compare different penetration - testing agents on the same basis, thereby promoting the further development of this field.

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Generative AI for pentesting: the good, the bad, the ugly

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

PentestAgent: Incorporating LLM Agents to Automated Penetration Testing

PentestGPT: An LLM-empowered Automatic Penetration Testing Tool

AI Cyber Risk Benchmark: Automated Exploitation Capabilities

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Getting pwn'd by AI: Penetration Testing with Large Language Models

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

A Preliminary Study on Using Large Language Models in Software Pentesting

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

DynPen: Automated Penetration Testing in Dynamic Network Scenarios Using Deep Reinforcement Learning

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

An Intelligent Penetration Testing Method Using Human Feedback

GTA: A Benchmark for General Tool Agents

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

The Current Status of Laparoscopic Aortic Aneurysm Repair

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark