Abstract:LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, model developers and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. In response, we introduce the Catastrophic Cyber Capabilities Benchmark (3CB), a novel framework designed to rigorously assess the real-world offensive capabilities of LLM agents. Our evaluation of modern LLMs on 3CB reveals that frontier models, such as GPT-4o and Claude 3.5 Sonnet, can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. Conversely, smaller open-source models exhibit limited offensive capabilities. Our software solution and the corresponding benchmark provides a critical tool to reduce the gap between rapidly improving capabilities and robustness of cyber offense evaluations, aiding in the safer deployment and regulation of these powerful technologies.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the deficiencies of large - language models (LLMs) in the field of cybersecurity, especially in the assessment of cyber - attack capabilities. Specifically: 1. **Lack of transparency and comprehensiveness**: Current evaluations of the cyber - attack capabilities of LLMs often lack transparency and do not comprehensively focus on their offensive capabilities. 2. **Coping with emerging threats**: As the capabilities of LLMs continue to improve, their potential threats in cyber - attacks are also increasing. To deal with these emerging threats, a systematic framework is required to assess the real - world attack capabilities of these models. 3. **Filling research gaps**: Although some research has explored the autonomous cyber - attack capabilities of LLMs, there are still relatively few benchmarks specifically for LLM cyber - attack capabilities. For this reason, the author introduced a new framework - **Catastrophic Cyber Capabilities Benchmark (3CB)** to strictly evaluate the offensive cyber - operation capabilities of LLMs in real - world environments. 3CB provides a comprehensive, transparent, and repeatable evaluation method through a series of challenging tasks, covering various technical categories in the MITRE ATT&CK matrix. ### Specific objectives - **Design and implement the 3CB framework**: Including an open - source software solution (3CB Harness) and a set of challenging tasks (3CB Challenge Set) to ensure the repeatability and extensibility of the evaluation. - **Evaluate the performance of leading - edge LLMs**: By evaluating multiple leading - edge LLMs, reveal the performance differences among them in different cyber - attack tasks. - **Identify the weaknesses and improvement directions of models**: By comparing the performance of different models, find out which models perform well in specific tasks and which ones are deficient, providing a basis for subsequent improvement. - **Promote safe deployment and regulation**: Through the evaluation results, help enterprises and governments better understand the potential risks of LLMs, thereby formulating more effective security strategies and regulations. ### Summary The core problem of this paper is to develop a comprehensive, transparent, and strict evaluation framework to evaluate the offensive capabilities of LLMs in the field of cybersecurity, especially to deal with the potential risks of malicious use.

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks

A Multiagent CyberBattleSim for RL Cyber Operation Agents

Countering Autonomous Cyber Threats

Evil Geniuses: Delving into the Safety of LLM-based Agents

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

SECURE: Benchmarking Large Language Models for Cybersecurity

Ollabench: Evaluating LLMs' Reasoning for Human-centric Interdependent Cybersecurity

CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification

Triple methods-based empirical assessment of the effectiveness of adaptive cyber defenses in the cloud

CMMR: A Composite Multidimensional Models Robustness Evaluation Framework for Deep Learning.