Abstract:This paper presents CyberSecEval, a comprehensive benchmark developed to help bolster the cybersecurity of Large Language Models (LLMs) employed as coding assistants. As what we believe to be the most extensive unified cybersecurity safety benchmark to date, CyberSecEval provides a thorough evaluation of LLMs in two crucial security domains: their propensity to generate insecure code and their level of compliance when asked to assist in cyberattacks. Through a case study involving seven models from the Llama 2, Code Llama, and OpenAI GPT large language model families, CyberSecEval effectively pinpointed key cybersecurity risks. More importantly, it offered practical insights for refining these models. A significant observation from the study was the tendency of more advanced models to suggest insecure code, highlighting the critical need for integrating security considerations in the development of sophisticated LLMs. CyberSecEval, with its automated test case generation and evaluation pipeline covers a broad scope and equips LLM designers and researchers with a tool to broadly measure and enhance the cybersecurity safety properties of LLMs, contributing to the development of more secure AI systems.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Risk of generating insecure code**: - When large - language models (LLMs) generate code, they may violate security best practices or introduce exploitable vulnerabilities. This risk is not theoretical, because developers often accept a large amount of code generated by these models. For example, a study on GitHub shows that 46% of the code on its platform is automatically generated by CoPilot; a study by Meta also found that when developers accept the suggestions of the CodeCompose model, the acceptance rate is 22%. In addition, previous studies have shown that 40% of code suggestions have vulnerabilities, and user studies have pointed out that developers are 10% more likely to accept the wrong code generated by LLM than when they write it themselves. - To mitigate this risk, CYBER SECEVAL has designed an automatic test case generation and evaluation pipeline, which can detect whether there are insecure coding practices in the code generated by LLM and provide directions for improvement. By iteratively optimizing the model according to these evaluation results, LLM designers and researchers can improve the security of the generated code. 2. **Risk of assisting in cyber - attacks**: - Another important question is whether LLMs will assist in cyber - attacks under malicious requests. Although many base models already have the ability to resist illegal and criminal activities, this study explores whether this ability applies to models with coding capabilities. - The study found that the code itself does not directly determine its maliciousness or benignity, and the key lies in the intention. Therefore, CYBER SECEVAL evaluates whether it will provide help under public malicious requests by testing the response of LLM to malicious requests. This helps product designers foresee and mitigate the risks associated with malicious applications. By understanding how their AI systems respond to such requests, developers can implement appropriate security measures, such as rejection skills or user warnings, to prevent the model from being misused. In general, CYBER SECEVAL aims to provide a comprehensive benchmarking tool to help LLM designers and researchers measure and enhance the security of LLM in terms of network security, thereby promoting the development of more secure artificial intelligence systems.

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

SECURE: Benchmarking Large Language Models for Cybersecurity

LLMSecCode: Evaluating Large Language Models for Secure Coding

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

Assessing Cybersecurity Vulnerabilities in Code Large Language Models

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations

SEvenLLM: Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence

A Preliminary Study on Using Large Language Models in Software Pentesting