An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation

Junjie Li,Fazle Rabbi,Cheng Cheng,Aseem Sangalay,Yuan Tian,Jinqiu Yang
2024-08-17
Abstract:AI-powered coding assistants such as GitHub Copilot and OpenAI ChatGPT have achieved notable success in automating code generation. However, these tools rely on pre-trained Large Language Models (LLMs) that are typically trained on human-written code sourced from open-source project hosting sites like GitHub, which often contains inherent security vulnerabilities. These vulnerabilities may then be mirrored in the code generated by these LLMs, a critical risk revealed and highlighted by recent empirical studies. In this work, we present an exploratory study on whether fine-tuning pre-trained LLMs on datasets of vulnerability-fixing commits can promote secure code generation. We explored two parameter-efficient fine-tuning techniques (LoRa and IA3) on two pre-trained LLMs for code generation. We crawled a fine-tuning dataset (14,622 C and C++ files) for secure code generation by collecting code fixes of confirmed vulnerabilities from open-source repositories. Our evaluation dataset comprises 52 vulnerability scenarios designed to cover the top most dangerous C and C++ Common Weakness Enumerations (CWEs). Each scenario is a prompt that may induce LLMs to generate vulnerable code. Our exploration reveals that fine-tuning LLMs can improve secure code generation by 6.4% in C language and 5.4% in C++ language. We further experimented with fine-tuning LLMs using different versions of the collected secure code dataset (block, function, and line). We found that fine-tuning with function-level and block-level datasets achieves the best secure code generation performance, compared to the alternatives (file-level and line-level).
Software Engineering
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to explore the possibility of generating safer code by fine-tuning large language models (LLMs). Specifically, the authors focus on the fact that existing automatic code generation tools (such as GitHub's Copilot and OpenAI's ChatGPT) have achieved significant success in automating code generation, but these tools rely on pre-trained large language models that are often trained on open-source project code containing potential security vulnerabilities. Consequently, the generated code may inherit these security vulnerabilities, which has been revealed and emphasized in recent empirical studies. To address this challenge, the authors conducted the following explorations: 1. **Fine-tuning techniques**: The authors experimented with two parameter-efficient fine-tuning techniques (LoRa and IA3) on two pre-trained LLMs (CodeLlama and CodeGen2). 2. **Dataset**: The authors collected 14,622 C/C++ files from open-source repositories that contain confirmed security vulnerability fixes as the fine-tuning dataset. 3. **Evaluation method**: The authors designed 52 security-sensitive scenarios covering the most dangerous common weakness enumerations (CWEs) in C/C++ to evaluate whether the fine-tuned LLMs can generate safer code. Through these explorations, the authors aim to answer the following research questions: - **RQ1**: Can fine-tuning LLMs reduce security vulnerabilities in the generated code? - **RQ2**: How does the granularity of the fine-tuning dataset affect the ability of fine-tuned LLMs to generate secure code? - **RQ3**: How does the size of the fine-tuning dataset affect the ability of fine-tuned LLMs to generate secure code? - **RQ4**: Does fine-tuning LLMs to improve security reduce their performance in generating functionally correct code? Ultimately, the authors hope to provide a comprehensive understanding of whether fine-tuning LLMs can effectively promote the generation of secure code without significantly reducing their ability to generate functionally correct code.