Abstract:Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license compliance capabilities of LLMs, i.e., the ability to provide accurate license or copyright information when they generate code with striking similarity to already existing copyrighted code. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

What problem does this paper attempt to address?

This paper attempts to solve the problem that the code generated by large - language models (LLMs) in code - generation tasks fails to provide necessary license information, thus leading to potential intellectual property (IP) infringement issues. Specifically, the paper focuses on how to evaluate whether LLMs can accurately provide corresponding license or copyright information when generating code, in order to ensure that code users can comply with the terms of use of open - source software. ### Main problems of the paper 1. **Risk of intellectual property infringement**: Since the training data of LLMs contains a large number of code fragments protected by open - source licenses, these models may generate code that is extremely similar to existing open - source code but fails to provide necessary license information. This may lead users to violate open - source license terms when using this code, thereby incurring legal risks. 2. **Lack of evaluation criteria**: Although existing research has evaluated the code - generation accuracy, robustness, and security of LLMs, there is not yet a dedicated evaluation benchmark for license compliance. Therefore, an evaluation framework needs to be established to measure whether LLMs can correctly handle license information during the code - generation process. ### Solutions To solve the above problems, the paper proposes the following key steps: 1. **Define the "significantly similar" standard**: Through empirical research, determine a reasonable "significantly similar" standard for distinguishing whether the code generated by LLMs is independently created or copied from existing code. This standard is based on the "access and substantial similarity" in legal principles, that is, when the similarity between two pieces of code is extremely high, so as to rule out the possibility of independent creation, it can be inferred that there is a copying behavior. 2. **Construct an evaluation benchmark**: Based on the above - mentioned standard, the paper designs an evaluation benchmark named LiCoEval for systematically evaluating the license - compliance ability of LLMs in code - generation tasks. This benchmark covers technical and legal considerations and aims to comprehensively evaluate whether LLMs can provide correct license information when generating similar code. 3. **Empirical analysis**: By evaluating 14 popular LLMs, the paper finds that even the best - performing LLMs will generate a certain proportion (0.88% to 2.01%) of outputs that are significantly similar to existing open - source code, and most models fail to provide accurate license information, especially when dealing with code under Copyleft licenses. ### Conclusions The research results of the paper emphasize the urgency of enhancing the license - compliance ability of LLMs, especially when dealing with code under Copyleft licenses. In addition, the research provides an important reference for future improvement of the training process of LLMs and standardization of their use, which helps protect the copyright of open - source software and reduce the legal risks of users. ### Formula examples The paper does not involve complex mathematical formulas, but when describing similarity calculation, it mentions several commonly - used text - similarity measurement methods, such as BLEU - 4, Jaccard similarity, and edit distance. Here is a brief introduction to these measurement methods: - **BLEU - 4**: \[ BLEU = BP\times\exp\left(\sum_{n = 1}^{4}w_n\log p_n\right) \] where \(BP\) is the penalty factor, \(p_n\) is the n - gram precision, and \(w_n\) is the weight. - **Jaccard similarity**: \[ J(A, B)=\frac{|A\cap B|}{|A\cup B|} \] where \(A\) and \(B\) are two sets, representing the feature sets of the generated code and the original code respectively. - **Edit distance**: \[ ED(A, B)=\min\left(\begin{array}{c} ED(A[1..m - 1], B[1..n])+ 1\\ ED(A[1..m], B[1..n - 1])+ 1\\ ED(A[1..m - 1], B[1..n - 1])+\text{cost}(A[m], B[n]) \end{array}\right) \] where \(\text{cost}(A[m], B[n])\) is the substitution.

LiCoEval: Evaluating LLMs on License Compliance in Code Generation

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

CodeJudge: Evaluating Code Generation with Large Language Models

DevEval: Evaluating Code Generation in Practical Software Projects

LMs: Understanding Code Syntax and Semantics for Code Analysis

Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

A Survey on Evaluating Large Language Models in Code Generation Tasks

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

LAiW: A Chinese Legal Large Language Models Benchmark

On Evaluating the Efficiency of Source Code Generated by LLMs

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

A Survey on Large Language Models for Code Generation

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code