Abstract:Large Language Models (LLMs) have shown impressive proficiency in code generation. Unfortunately, these models share a weakness with their human counterparts: producing code that inadvertently has security vulnerabilities. These vulnerabilities could allow unauthorized attackers to access sensitive data or systems, which is unacceptable for safety-critical applications. In this work, we propose Feedback-Driven Security Patching (FDSP), where LLMs automatically refine generated, vulnerable code. Our approach leverages automatic static code analysis to empower the LLM to generate and implement potential solutions to address vulnerabilities. We address the research communitys needs for safe code generation by introducing a large-scale dataset, PythonSecurityEval, covering the diversity of real-world applications, including databases, websites and operating systems. We empirically validate that FDSP outperforms prior work that uses self-feedback from LLMs by up to 17.6% through our procedure that injects targeted, external feedback. Code and data are available at \url{<a class="link-external link-https" href="https://github.com/Kamel773/LLM-code-refine" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to use large - language models (LLMs) to automatically fix security vulnerabilities in generated code**. Specifically, researchers are concerned that although current LLMs perform well in code generation, the code they generate may contain security vulnerabilities, which may allow unauthorized attackers to access sensitive data or systems, and this is unacceptable in critical applications. To address this challenge, the authors propose the **Feedback - Driven Security Patching (FDSP)** method. FDSP improves the code generated by LLMs through the following steps: 1. **Code Generation**: LLMs generate Python code based on natural - language descriptions. 2. **Code Testing**: Use static code analysis tools (such as Bandit) to detect potential security vulnerabilities in the generated code. 3. **Generate Potential Solutions**: LLMs generate multiple possible repair solutions based on the feedback provided by static code analysis tools. 4. **Code Optimization**: Feed each potential solution and the original vulnerable code back to the LLMs multiple times for further optimization until there are no more security issues in the code. In addition, the authors also introduce a large - scale dataset named **PythonSecurityEval**, which covers code examples in various real - world applications, including databases, websites, and operating systems, etc. Through this dataset, researchers can more comprehensively evaluate the ability of LLMs to generate secure code. In summary, the main contributions of this paper include: - Proposing the FDSP method, enabling LLMs to automatically generate code to repair security vulnerabilities based on the feedback from static code analysis tools. - Constructing the PythonSecurityEval dataset for evaluating the ability of LLMs to generate secure code. - Experimental verification shows that FDSP outperforms existing self - feedback methods in terms of fixing security vulnerabilities, with an improvement of up to 17.6%. Through this method, researchers hope to improve the security of code generated by LLMs, so as to be better applied in actual development scenarios.

Can LLMs Patch Security Issues?

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

Code Security Vulnerability Repair Using Reinforcement Learning with Large Language Models

Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

SALLM: Security Assessment of Generated Code

Software Vulnerability and Functionality Assessment using LLMs

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

LLM Security Guard for Code

Automated Software Vulnerability Patching using Large Language Models

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

LLM-Powered Code Vulnerability Repair with Reinforcement Learning and Semantic Reward

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation

Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation

A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems