Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models

Minhao Bai,Kaiyi Pang,Yongfeng Huang

2024-04-28

Abstract:In the rapidly evolving domain of artificial intelligence, safeguarding the intellectual property of Large Language Models (LLMs) is increasingly crucial. Current watermarking techniques against model extraction attacks, which rely on signal insertion in model logits or post-processing of generated text, remain largely heuristic. We propose a novel method for embedding learnable linguistic watermarks in LLMs, aimed at tracing and preventing model extraction attacks. Our approach subtly modifies the LLM's output distribution by introducing controlled noise into token frequency distributions, embedding an statistically identifiable controllable watermark.We leverage statistical hypothesis testing and information theory, particularly focusing on Kullback-Leibler Divergence, to differentiate between original and modified distributions effectively. Our watermarking method strikes a delicate well balance between robustness and output quality, maintaining low false positive/negative rates and preserving the LLM's original performance.

Cryptography and Security,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the problem of embedding learnable linguistic watermarks in large language models (LLMs) to track and prevent model extraction attacks. Specifically, current watermarking techniques for adversarial model extraction attacks mainly rely on inserting signals into the model's output logits or post-processing generated text, and these methods are still relatively empirical. Therefore, the paper proposes a novel approach that subtly alters the model's output distribution by introducing controlled noise into the token frequency distribution in LLMs, thereby embedding statistically identifiable controllable watermarks. The paper leverages statistical hypothesis testing and information theory, particularly Kullback-Leibler divergence, to effectively distinguish between the original and modified distributions. This watermarking method strikes a good balance between robustness and output quality, maintaining low false positive and false negative rates while preserving the original performance of LLMs. Additionally, the paper discusses different types of watermarks (black-box, gray-box, and white-box) and focuses on watermarking methods in the gray-box scenario, aiming to achieve stable watermark detection without degrading the quality of the generated text.

Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models

ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

Large Language Model Watermark Stealing With Mixed Integer Programming

Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection

A Watermark for Large Language Models

Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?

Unbiased Watermark for Large Language Models

Signal Watermark on Large Language Models

Watermarking LLMs with Weight Quantization

Watermarking Large Language Models and the Generated Content: Opportunities and Challenges

Advancing Beyond Identification: Multi-bit Watermark for Large Language Models

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

Adaptive Text Watermark for Large Language Models

On the Learnability of Watermarks for Language Models

A Semantic Invariant Robust Watermark for Large Language Models

Watermark Smoothing Attacks against Language Models

Reliable Model Watermarking: Defending Against Theft without Compromising on Evasion

Baselines for Identifying Watermarked Large Language Models

A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules

Black-Box Detection of Language Model Watermarks

Provably Robust Watermarks for Open-Source Language Models