CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code

Batu Guan,Yao Wan,Zhangqian Bi,Zheng Wang,Hongyu Zhang,Yulei Sui,Pan Zhou,Lichao Sun
2024-04-24
Abstract:As Large Language Models (LLMs) are increasingly used to automate code generation, it is often desired to know if the code is AI-generated and by which model, especially for purposes like protecting intellectual property (IP) in industry and preventing academic misconduct in education. Incorporating watermarks into machine-generated content is one way to provide code provenance, but existing solutions are restricted to a single bit or lack flexibility. We present CodeIP, a new watermarking technique for LLM-based code generation. CodeIP enables the insertion of multi-bit information while preserving the semantics of the generated code, improving the strength and diversity of the inerseted watermark. This is achieved by training a type predictor to predict the subsequent grammar type of the next token to enhance the syntactical and semantic correctness of the generated code. Experiments on a real-world dataset across five programming languages showcase the effectiveness of CodeIP.
Computation and Language
What problem does this paper attempt to address?