LiCoEval: Evaluating LLMs on License Compliance in Code Generation

Weiwei Xu,Kai Gao,Hao He,Minghui Zhou
2024-11-12
Abstract:Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license compliance capabilities of LLMs, i.e., the ability to provide accurate license or copyright information when they generate code with striking similarity to already existing copyrighted code. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.
Software Engineering,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem that the code generated by large - language models (LLMs) in code - generation tasks fails to provide necessary license information, thus leading to potential intellectual property (IP) infringement issues. Specifically, the paper focuses on how to evaluate whether LLMs can accurately provide corresponding license or copyright information when generating code, in order to ensure that code users can comply with the terms of use of open - source software. ### Main problems of the paper 1. **Risk of intellectual property infringement**: Since the training data of LLMs contains a large number of code fragments protected by open - source licenses, these models may generate code that is extremely similar to existing open - source code but fails to provide necessary license information. This may lead users to violate open - source license terms when using this code, thereby incurring legal risks. 2. **Lack of evaluation criteria**: Although existing research has evaluated the code - generation accuracy, robustness, and security of LLMs, there is not yet a dedicated evaluation benchmark for license compliance. Therefore, an evaluation framework needs to be established to measure whether LLMs can correctly handle license information during the code - generation process. ### Solutions To solve the above problems, the paper proposes the following key steps: 1. **Define the "significantly similar" standard**: Through empirical research, determine a reasonable "significantly similar" standard for distinguishing whether the code generated by LLMs is independently created or copied from existing code. This standard is based on the "access and substantial similarity" in legal principles, that is, when the similarity between two pieces of code is extremely high, so as to rule out the possibility of independent creation, it can be inferred that there is a copying behavior. 2. **Construct an evaluation benchmark**: Based on the above - mentioned standard, the paper designs an evaluation benchmark named LiCoEval for systematically evaluating the license - compliance ability of LLMs in code - generation tasks. This benchmark covers technical and legal considerations and aims to comprehensively evaluate whether LLMs can provide correct license information when generating similar code. 3. **Empirical analysis**: By evaluating 14 popular LLMs, the paper finds that even the best - performing LLMs will generate a certain proportion (0.88% to 2.01%) of outputs that are significantly similar to existing open - source code, and most models fail to provide accurate license information, especially when dealing with code under Copyleft licenses. ### Conclusions The research results of the paper emphasize the urgency of enhancing the license - compliance ability of LLMs, especially when dealing with code under Copyleft licenses. In addition, the research provides an important reference for future improvement of the training process of LLMs and standardization of their use, which helps protect the copyright of open - source software and reduce the legal risks of users. ### Formula examples The paper does not involve complex mathematical formulas, but when describing similarity calculation, it mentions several commonly - used text - similarity measurement methods, such as BLEU - 4, Jaccard similarity, and edit distance. Here is a brief introduction to these measurement methods: - **BLEU - 4**: \[ BLEU = BP\times\exp\left(\sum_{n = 1}^{4}w_n\log p_n\right) \] where \(BP\) is the penalty factor, \(p_n\) is the n - gram precision, and \(w_n\) is the weight. - **Jaccard similarity**: \[ J(A, B)=\frac{|A\cap B|}{|A\cup B|} \] where \(A\) and \(B\) are two sets, representing the feature sets of the generated code and the original code respectively. - **Edit distance**: \[ ED(A, B)=\min\left(\begin{array}{c} ED(A[1..m - 1], B[1..n])+ 1\\ ED(A[1..m], B[1..n - 1])+ 1\\ ED(A[1..m - 1], B[1..n - 1])+\text{cost}(A[m], B[n]) \end{array}\right) \] where \(\text{cost}(A[m], B[n])\) is the substitution.