Abstract:With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at <a class="link-external link-https" href="https://github.com/AI-secure/RedCode" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security and security evaluation issues of code agents in AI - assisted programming and software development. With the rapid improvement and wide application of code agents' capabilities, security issues such as generating or executing malicious code have become significant obstacles to the actual deployment of these agents. Specifically, the paper focuses on the following aspects: 1. **Security evaluation of code agents**: Existing code agents can not only generate static code but also dynamically execute code and interact with the system environment. This introduces an additional layer of risk because it is necessary to evaluate the vulnerabilities of the generated code and the security of the code agents' behavior in various execution environments. 2. **Lack of comprehensive security benchmark tests**: Although there has been some work on security evaluation of code generated by code LLM (Large Language Model), the comprehensive security evaluation of code agents is still insufficient, especially the security when executing code in real systems. 3. **Differences in input formats between natural languages and programming languages**: Code agents need to handle multiple input formats, including natural language descriptions and code snippets in different programming languages. These different input formats may cause the agents to exhibit different behaviors when handling risky code. To solve these problems, the authors propose an evaluation platform named RedCode, aiming to provide a comprehensive and practical security evaluation of code agents through the following four key principles: 1. **Actual interaction with the system**: Build Docker images to ensure that test cases are compatible with each agent framework and allow code agents to execute code in a real - environment. 2. **Comprehensive evaluation of the security of code generation and execution**: - **RedCode - Exec**: By providing Python or Bash code containing risky code snippets or natural language descriptions, evaluate the code agents' ability to identify and handle risky code. A total of more than 4,000 test instances are provided, covering 25 key vulnerabilities. - **RedCode - Gen**: By providing function signatures and docstrings as input, evaluate whether code agents will generate harmful code or software according to the instructions. A total of 160 prompts are provided, involving 8 malware families. 3. **Diversified input formats**: Support multiple input formats of natural languages and programming languages (such as Python and Bash) to evaluate the performance of code agents under different input formats. 4. **High - quality risk scenarios and tests**: Collect risk scenarios from Common Weakness Enumeration (CWE) and other security benchmarks, and manually modify them to form 25 risk scenarios covering different fields. Each risk scenario has corresponding high - quality test cases and evaluation scripts. Through these principles, RedCode provides a comprehensive evaluation platform, which helps researchers and developers better understand the security of code agents and provides a basis for improving the security of code agents.

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

ReCode: Robustness Evaluation of Code Generation Models

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

Evil Geniuses: Delving into the Safety of LLM-based Agents

CodeAgent: Autonomous Communicative Agents for Code Review

Uncovering Weaknesses in Neural Code Generation

SciCode: A Research Coding Benchmark Curated by Scientists

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage

Query-Efficient Black-Box Red Teaming via Bayesian Optimization

RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

REDO: Execution-Free Runtime Error Detection for COding Agents

Red-Teaming for Generative AI: Silver Bullet or Security Theater?

Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity