Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models

Minhao Bai,Kaiyi Pang,Yongfeng Huang
2024-04-28
Abstract:In the rapidly evolving domain of artificial intelligence, safeguarding the intellectual property of Large Language Models (LLMs) is increasingly crucial. Current watermarking techniques against model extraction attacks, which rely on signal insertion in model logits or post-processing of generated text, remain largely heuristic. We propose a novel method for embedding learnable linguistic watermarks in LLMs, aimed at tracing and preventing model extraction attacks. Our approach subtly modifies the LLM's output distribution by introducing controlled noise into token frequency distributions, embedding an statistically identifiable controllable watermark.We leverage statistical hypothesis testing and information theory, particularly focusing on Kullback-Leibler Divergence, to differentiate between original and modified distributions effectively. Our watermarking method strikes a delicate well balance between robustness and output quality, maintaining low false positive/negative rates and preserving the LLM's original performance.
Cryptography and Security,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of embedding learnable linguistic watermarks in large language models (LLMs) to track and prevent model extraction attacks. Specifically, current watermarking techniques for adversarial model extraction attacks mainly rely on inserting signals into the model's output logits or post-processing generated text, and these methods are still relatively empirical. Therefore, the paper proposes a novel approach that subtly alters the model's output distribution by introducing controlled noise into the token frequency distribution in LLMs, thereby embedding statistically identifiable controllable watermarks. The paper leverages statistical hypothesis testing and information theory, particularly Kullback-Leibler divergence, to effectively distinguish between the original and modified distributions. This watermarking method strikes a good balance between robustness and output quality, maintaining low false positive and false negative rates while preserving the original performance of LLMs. Additionally, the paper discusses different types of watermarks (black-box, gray-box, and white-box) and focuses on watermarking methods in the gray-box scenario, aiming to achieve stable watermark detection without degrading the quality of the generated text.