Abstract:Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.

What problem does this paper attempt to address?

The problem this paper attempts to address is the security vulnerabilities present in large language models (LLMs) during code generation and code repair processes. Although these models excel in improving development efficiency, they are often trained on unfiltered open-source code repositories, which may inadvertently propagate security vulnerabilities. While current research has explored the security of code generation, there is still a lack of comprehensive evaluation and resolution of the security features of these models. This paper aims to conduct a comprehensive evaluation of the security of existing models in code generation and repair processes by introducing a carefully designed dataset, CodeSecEval, and proposes effective strategies to mitigate security vulnerabilities. This will raise awareness of these issues within the software engineering community and promote the development of safer and more reliable LLM applications. Specifically, the researchers address this problem through the following approaches: 1. **Constructing the CodeSecEval dataset**: This dataset contains 180 samples covering 44 critical vulnerability types. Each sample includes executable secure and insecure code examples as well as test cases, enabling automatic and precise evaluation of code security. 2. **Evaluating the performance of existing models**: By extensively evaluating 7 state-of-the-art code generation LLMs, the researchers reveal the phenomenon that these models generally overlook security issues when handling code generation and repair. 3. **Proposing mitigation strategies**: Based on the evaluation results, the researchers propose strategies that integrate vulnerability awareness information and explanations of insecure code to significantly enhance security in the code generation and repair processes. 4. **Validating the effectiveness of the strategies**: Through experiments, the effectiveness of the aforementioned strategies is validated, demonstrating that these methods can significantly reduce security vulnerabilities. In summary, this research not only provides new tools and methods for evaluating and enhancing the security of LLMs in code generation and repair tasks but also offers valuable recommendations for future LLM training and deployment, contributing to safer programming practices.

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

Large Language Models for Code: Security Hardening and Adversarial Testing

An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation

How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

SecCoder: Towards Generalizable and Robust Secure Code Generation

CoSec: On-the-Fly Security Hardening of Code LLMs Via Supervised Co-Decoding

Security of Language Models for Code: A Systematic Literature Review

ProSec: Fortifying Code LLMs with Proactive Security Alignment

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

Large Language Models and Code Security: A Systematic Literature Review

Fine Tuning Large Language Model for Secure Code Generation

Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Code Security Vulnerability Repair Using Reinforcement Learning with Large Language Models

Codexity: Secure AI-assisted Code Generation

LLMSecCode: Evaluating Large Language Models for Secure Coding