Abstract:Code Summarization Model (CSM) has been widely used in code production, such as online and web programming for PHP and Javascript. CSMs are essential tools in code production, enhancing software development efficiency and driving innovation in automated code analysis. However, CSMs face risks of exploitation by unauthorized users, particularly in an online environment where CSMs can be easily shared and disseminated. To address these risks, digital watermarks offer a promising solution by embedding imperceptible signatures within the models to assert copyright ownership and track unauthorized usage. Traditional watermarking for CSM copyright protection faces two main challenges: 1) dataset watermarking methods require separate design of triggers and watermark features based on the characteristics of different programming languages, which not only increases the computation complexity but also leads to a lack of generalization, 2) existing watermarks based on code style transformation are easily identifiable by automated detection, demonstrating poor concealment. To tackle these issues, we propose ModMark , a novel model-level digital watermark embedding method. Specifically, by fine-tuning the tokenizer, ModMark achieves cross-language generalization while reducing the complexity of watermark design. Moreover, we employ code noise injection techniques to effectively prevent trigger detection. Experimental results show that our method can achieve 100% watermark verification rate across various programming languages' CSMs, and the concealment and effectiveness of ModMark can also be guaranteed.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the Code Summarization Model (CSM) is easily illegally copied and spread by unauthorized users in the online environment. Specifically, CSM is widely used in code production, such as PHP and JavaScript in online and Web programming. These models can significantly improve the efficiency of software development and promote the innovation of automated code analysis. However, because CSM depends on large - scale data sets for training and its high value makes it a target for theft, protecting the copyright of CSM becomes crucial. To solve these problems, researchers have proposed digital watermarking technology, which claims copyright ownership and tracks unauthorized use by embedding imperceptible signatures in the model. However, traditional data - set - based watermarking methods have two main challenges: 1. It is necessary to design triggers and watermark features separately according to the characteristics of different programming languages, which not only increases the computational complexity but also leads to insufficient generalization ability. 2. Existing watermarks based on code - style conversion are easily recognized by automated detection methods and have poor concealment. To meet these challenges, the author proposes a new model - level digital watermark embedding method - ModMark. Specifically, by fine - tuning the tokenizer, ModMark achieves cross - language generalization and reduces the complexity of watermark design. In addition, the author adopts the code noise injection technique, which effectively prevents the trigger from being detected. The experimental results show that ModMark can achieve a 100% watermark verification rate in CSMs of various programming languages while ensuring concealment and effectiveness. The following are the main contributions of ModMark: - Proposed ModMark, the first model - level watermark embedding method for CSM, realizing the copyright protection of CSM. - ModMark shows high efficiency in various language models, breaking through the limitations of traditional data - set watermarking techniques in trigger feature construction and ensuring that the trigger is not recognized by automated detection methods. - Conducted a comprehensive verification of watermark harmlessness, effectiveness, complexity, and concealment. ### Summary This paper aims to solve the copyright protection problem faced by the code summary model in the online environment by proposing a new model - level watermark embedding method ModMark, especially in view of the limitations of traditional data - set watermarking methods. ModMark effectively improves the concealment and generalization ability of watermarks by fine - tuning the tokenizer and adopting the code noise injection technique, thereby better protecting the copyright of CSM.

Beyond Dataset Watermarking: Model-Level Copyright Protection for Code Summarization Models

CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models

Towards Tracing Code Provenance with Code Watermarking

CodeWMBench: an Automated Benchmark for Code Watermarking Evaluation

MCGMark: An Encodable and Robust Online Watermark for LLM-Generated Malicious Code

Mark My Words: Analyzing and Evaluating Language Model Watermarks

Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models

Towards Codable Text Watermarking for Large Language Models

ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

PostMark: A Robust Blackbox Watermark for Large Language Models

Towards Codable Watermarking for Injecting Multi-bits Information to LLMs

Watermarking Large Language Models and the Generated Content: Opportunities and Challenges

A Robust Semantics-based Watermark for Large Language Model against Paraphrasing

CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code

WAPITI: A Watermark for Finetuned Open-Source LLMs

Segmenting Watermarked Texts From Language Models

Watermarking for Stable Diffusion Models

Watermarking Language Models with Error Correcting Codes

Protecting Copyright of Stable Diffusion Models from Ambiguity Attacks

On the Reliability of Watermarks for Large Language Models

Watermarking Text Data on Large Language Models for Dataset Copyright