Abstract:AI-powered coding assistants such as GitHub Copilot and OpenAI ChatGPT have achieved notable success in automating code generation. However, these tools rely on pre-trained Large Language Models (LLMs) that are typically trained on human-written code sourced from open-source project hosting sites like GitHub, which often contains inherent security vulnerabilities. These vulnerabilities may then be mirrored in the code generated by these LLMs, a critical risk revealed and highlighted by recent empirical studies. In this work, we present an exploratory study on whether fine-tuning pre-trained LLMs on datasets of vulnerability-fixing commits can promote secure code generation. We explored two parameter-efficient fine-tuning techniques (LoRa and IA3) on two pre-trained LLMs for code generation. We crawled a fine-tuning dataset (14,622 C and C++ files) for secure code generation by collecting code fixes of confirmed vulnerabilities from open-source repositories. Our evaluation dataset comprises 52 vulnerability scenarios designed to cover the top most dangerous C and C++ Common Weakness Enumerations (CWEs). Each scenario is a prompt that may induce LLMs to generate vulnerable code. Our exploration reveals that fine-tuning LLMs can improve secure code generation by 6.4% in C language and 5.4% in C++ language. We further experimented with fine-tuning LLMs using different versions of the collected secure code dataset (block, function, and line). We found that fine-tuning with function-level and block-level datasets achieves the best secure code generation performance, compared to the alternatives (file-level and line-level).

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore the possibility of generating safer code by fine-tuning large language models (LLMs). Specifically, the authors focus on the fact that existing automatic code generation tools (such as GitHub's Copilot and OpenAI's ChatGPT) have achieved significant success in automating code generation, but these tools rely on pre-trained large language models that are often trained on open-source project code containing potential security vulnerabilities. Consequently, the generated code may inherit these security vulnerabilities, which has been revealed and emphasized in recent empirical studies. To address this challenge, the authors conducted the following explorations: 1. **Fine-tuning techniques**: The authors experimented with two parameter-efficient fine-tuning techniques (LoRa and IA3) on two pre-trained LLMs (CodeLlama and CodeGen2). 2. **Dataset**: The authors collected 14,622 C/C++ files from open-source repositories that contain confirmed security vulnerability fixes as the fine-tuning dataset. 3. **Evaluation method**: The authors designed 52 security-sensitive scenarios covering the most dangerous common weakness enumerations (CWEs) in C/C++ to evaluate whether the fine-tuned LLMs can generate safer code. Through these explorations, the authors aim to answer the following research questions: - **RQ1**: Can fine-tuning LLMs reduce security vulnerabilities in the generated code? - **RQ2**: How does the granularity of the fine-tuning dataset affect the ability of fine-tuned LLMs to generate secure code? - **RQ3**: How does the size of the fine-tuning dataset affect the ability of fine-tuned LLMs to generate secure code? - **RQ4**: Does fine-tuning LLMs to improve security reduce their performance in generating functionally correct code? Ultimately, the authors hope to provide a comprehensive understanding of whether fine-tuning LLMs can effectively promote the generation of secure code without significantly reducing their ability to generate functionally correct code.

An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation

Fine Tuning Large Language Model for Secure Code Generation

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

ProSec: Fortifying Code LLMs with Proactive Security Alignment

Finetuning Large Language Models for Vulnerability Detection

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Large Language Models for Code: Security Hardening and Adversarial Testing

Instruction Tuning for Secure Code Generation

Can LLMs Patch Security Issues?

Code Security Vulnerability Repair Using Reinforcement Learning with Large Language Models

CoSec: On-the-Fly Security Hardening of Code LLMs Via Supervised Co-Decoding

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

PromSec: Prompt Optimization for Secure Generation of Functional Source Code with Large Language Models (LLMs)