Abstract:Text content created by humans or language models is often stolen or misused by adversaries. Tracing text provenance can help claim the ownership of text content or identify the malicious users who distribute misleading content like machine-generated fake news. There have been some attempts to achieve this, mainly based on watermarking techniques. Specifically, traditional text watermarking methods embed watermarks by slightly altering text format like line spacing and font, which, however, are fragile to cross-media transmissions like OCR. Considering this, natural language watermarking methods represent watermarks by replacing words in original sentences with synonyms from handcrafted lexical resources (e.g., WordNet), but they do not consider the substitution's impact on the overall sentence's meaning. Recently, a transformer-based network was proposed to embed watermarks by modifying the unobtrusive words (e.g., function words), which also impair the sentence's logical and semantic coherence. Besides, one well-trained network fails on other different types of text content. To address the limitations mentioned above, we propose a natural language watermarking scheme based on context-aware lexical substitution (LS). Specifically, we employ BERT to suggest LS candidates by inferring the semantic relatedness between the candidates and the original sentence. Based on this, a selection strategy in terms of synchronicity and substitutability is further designed to test whether a word is exactly suitable for carrying the watermark signal. Extensive experiments demonstrate that, under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences and has a better transferability than existing methods. Besides, the proposed LS approach outperforms the state-of-the-art approach on the Stanford Word Substitution Benchmark.

Detecting the Theft of Natural Language Text Using Birthmark

Copyright Protection Scheme of Natural Language Text Using Birthmark

Detecting Documents Forged by Printing and Copying

A Survey of Digital Passive Lossless Forensics on Forged and Altered Document

Handwritten Chinese Signature Detection on Scanned Technical Documents for Authenticity Verification

A Text Watermarking Algorithm based on Hidden Object.

Replacement Attacks On Behavior Based Software Birthmark

Steal My Artworks for Fine-tuning? A Watermarking Framework for Detecting Art Theft Mimicry in Text-to-Image Models

A Novel Scheme for Watermarking Natural Language Text

Tracing Text Provenance Via Context-Aware Lexical Substitution

Robust Multi-bit Natural Language Watermarking through Invariant Features

DeepTextMark: A Deep Learning-Driven Text Watermarking Approach for Identifying Large Language Model Generated Text

Replacement attacks: automatically evading behavior-based software birthmark

A Hybrid Intelligent Text Watermarking and Natural Language Processing Approach for Transferring and Receiving an Authentic English Text Via Internet

A Software Birthmark Based on System Call and Program Data Dependence

WaterSeeker: Pioneering Efficient Detection of Watermarked Segments in Large Documents

CTP-Net: Character Texture Perception Network for Document Image Forgery Localization

Segmenting Watermarked Texts From Language Models

Discovering Clues of Spoofed LM Watermarks

DeepTextMark: Deep Learning based Text Watermarking for Detection of Large Language Model Generated Text

Watermarking Text Data on Large Language Models for Dataset Copyright