Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written

Md Abdul Awal,Mrigank Rochan,Chanchal K. Roy
2024-11-16
Abstract:Thanks to the widespread adoption of Large Language Models (LLMs) in software engineering research, the long-standing dream of automated code generation has become a reality on a large scale. Nowadays, LLMs such as GitHub Copilot and ChatGPT are extensively used in code generation for enterprise and open-source software development and maintenance. Despite their unprecedented successes in code generation, research indicates that codes generated by LLMs exhibit vulnerabilities and security issues. Several studies have been conducted to evaluate code generated by LLMs, considering various aspects such as security, vulnerability, code smells, and robustness. While some studies have compared the performance of LLMs with that of humans in various software engineering tasks, there's a notable gap in research: no studies have directly compared human-written and LLM-generated code for their robustness analysis. To fill this void, this paper introduces an empirical study to evaluate the adversarial robustness of Pre-trained Models of Code (PTMCs) fine-tuned on code written by humans and generated by LLMs against adversarial attacks for software clone detection. These attacks could potentially undermine software security and reliability. We consider two datasets, two state-of-the-art PTMCs, two robustness evaluation criteria, and three metrics to use in our experiments. Regarding effectiveness criteria, PTMCs fine-tuned on human-written code always demonstrate more robustness than those fine-tuned on LLMs-generated code. On the other hand, in terms of adversarial code quality, in 75% experimental combinations, PTMCs fine-tuned on the human-written code exhibit more robustness than the PTMCs fine-tuned on the LLMs-generated code.
Software Engineering
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the difference in robustness between code generated by large language models (LLMs) and human - written code when facing adversarial attacks. Specifically, the paper aims to fill a gap in existing research: there is no direct comparison of the robustness of human - written code and LLM - generated code under adversarial attacks. ### Research Background In recent years, with the wide application of large language models (such as GitHub Copilot and ChatGPT) in automated code generation, although the code generated by these models has achieved remarkable success, it has also exposed some security and vulnerability issues. Although there have been studies evaluating the security, vulnerabilities, code smells, and robustness of LLM - generated code, no study has systematically compared the performance of human - written code and LLM - generated code under adversarial attacks. ### Research Motivation The motivations of this research are as follows: 1. **Security of Code Generation**: With the wide use of LLM - generated code in practical software engineering applications, it is crucial to evaluate its robustness against adversarial attacks. 2. **Lack of Systematic Comparison**: In current research, there has not been a systematic comparison of the robustness of human - written code and LLM - generated code in different software engineering tasks, especially in terms of performance under adversarial attacks. ### Research Questions The main research questions in the paper are: - **RQ**: In terms of code generation, who can better resist adversarial attacks: large language models or humans? ### Research Methods To answer the above research questions, the paper designed an empirical study, which specifically includes the following aspects: 1. **Dataset Selection**: Two datasets - SemanticCloneBench (human - written code) and GPTCloneBench (LLM - generated code) were selected to generate adversarial samples. 2. **Pre - training Model Selection**: Two state - of - the - art pre - trained code models (PTMCs) - CodeBERT and CodeGPT were selected. 3. **Adversarial Attack Methods**: Four black - box adversarial attack methods (ALERT, WIR - Random, MHM, and StyleTransfer) were adopted. 4. **Evaluation Metrics**: Metrics such as Accuracy, Precision, and Recall were used to evaluate the performance of PTMCs under adversarial attacks. ### Main Findings Through experiments, the paper found that: - When facing adversarial attacks, PTMCs fine - tuned based on human - written code generally show higher robustness. - In 75% of the experimental combinations, PTMCs fine - tuned based on human - written code also show higher robustness in terms of adversarial code quality. ### Conclusions This research shows that, at the current technological level, the robustness of human - written code against adversarial attacks is still better than that of LLM - generated code. This conclusion has important guiding significance for how to use LLM - generated code for more secure software development in the future. ### Formula Representation The formulas involved in the paper are as follows: \[ N = R C_n=\frac{R!}{n!(R - n)!} \] where: - \( N \) represents the number of combinations of selecting \( n \) elements from \( R \) elements. - \( R \) represents the total number of code fragments. - \( n \) represents the number of code fragments selected each time. For example, for 10 code fragments, the number of combinations is calculated as: \[ 10 C_2=\frac{10!}{2!(10 - 2)!}=45 \] This indicates that the number of combinations of selecting 2 fragments from 10 code fragments is 45.