What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to achieve higher performance in the high - school - level hacking benchmark test (InterCode - CTF) by using a simple large - language - model (LLM) agent design. Specifically, the author hopes to make the LLM agent show the ability close to or surpassing that of human high - school students when solving CTF challenges through methods such as improving the prompting strategy, tool use, and multiple attempts. ### Main problems of the paper: 1. **Improving the performance of LLM in CTF challenges**: - The LLM used in previous studies had poor performance on the InterCode - CTF benchmark. For example, Phuong et al. (2024) only achieved a 29% success rate, while Abramovich et al. (2024) achieved a 72% success rate through complex engineering design. - The goal of this study is to significantly improve the performance of LLM through simpler methods (such as different prompting strategies, expanding the toolset, and multiple attempts) and finally achieve a 95% success rate. 2. **Verifying the network security capabilities of LLM**: - Network security is one of the key areas of AI risks. Advanced LLM may have the ability to quickly attack real - world systems. Therefore, it is very important to quantify the network security capabilities of LLM. - The author hopes to prove through experiments on InterCode - CTF that the current LLM has exceeded the high - school - level network security capabilities, and these capabilities can be fully utilized through simple agent design. ### Main methods: - **ReAct&Plan prompting strategy**: A strategy that combines planning and execution, enabling the LLM to generate action plans according to the task description and observation results in each round and gradually solve the problem. - **Multiple attempts**: Allowing the agent to make multiple attempts on the same task to reduce the influence of randomness in a single attempt. - **Expanding the toolset**: Pre - installing some commonly used tools and Python packages to enhance the LLM's ability to perform complex tasks. - **Structured output**: Ensuring that the commands generated by the LLM and the submitted flags are in the correct format. ### Experimental results: - Finally, the author's agent achieved a 95% success rate on the InterCode - CTF benchmark, significantly exceeding the previous research results. - This shows that through appropriate prompting strategies and tool support, LLM can perform excellently in high - school - level CTF challenges and even surpass the abilities of human high - school students. ### Conclusions: - This study shows that the current LLM's capabilities in network security are underestimated, and its performance can be greatly improved through simple agent design. - With the progress of LLM, future research needs to use more complex benchmarks (such as NYU - CTF, 3CB, HackTheBox) to evaluate AI risks and capabilities.

Hacking CTFs with Plain Agents

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Imprompter: Tricking LLM Agents into Improper Tool Use

Hacking, The Lazy Way: LLM Augmented Pentesting

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

PentestAgent: Incorporating LLM Agents to Automated Penetration Testing

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

LLM Agents can Autonomously Exploit One-day Vulnerabilities

The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks

Battle Ground: Data Collection and Labeling of CTF Games to Understand Human Cyber Operators

Evil Geniuses: Delving into the Safety of LLM-based Agents

LLM Agents can Autonomously Hack Websites

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents