Watermarking Generative Tabular Data

Hengzhi He,Peiyu Yu,Junpeng Ren,Ying Nian Wu,Guang Cheng
2024-05-23
Abstract:In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature's value range into finely segmented intervals and embeds watermarks into selected ``green list" intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.
Cryptography and Security,Applications
What problem does this paper attempt to address?
The paper aims to address the issue of watermarking tabular data. Specifically, it proposes a simple and effective watermarking mechanism for tabular data and theoretically proves that the proposed watermark can be effectively detected while maintaining high data fidelity and demonstrating good robustness against additive noise attacks. ### The main contributions include: 1. **Theoretical Guarantee of Data Fidelity**: - It is theoretically proven that embedding the watermark by subdividing intervals can make the watermarked data very close to the original data, specifically showing an error rate of \(O\left(\frac{1}{\sqrt{m}}\right)\), where \(m\) is the number of "whitelist" intervals. - Experimental validation shows that applying the proposed watermarking method on synthetic and real datasets results in minimal loss of data fidelity and usability. 2. **Detection Framework Based on Statistical Hypothesis Testing**: - A detection framework for tabular data watermarking based on statistical hypothesis testing is proposed, which only requires the assumption that the underlying data distribution has a continuous density function. - Theoretical results show that as the number of intervals \(m\) increases, the probability of data points falling within the "whitelist" intervals converges to \(\frac{1}{2}\). 3. **Robustness Against Additive Noise Attacks**: - The proposed tabular data watermarking demonstrates good robustness against additive noise attacks. Even when the attacker applies large noise to almost all elements, the watermark remains effective. - Theoretical analysis shows that if the success probability of attacking a single element is limited to \(\frac{1}{2}\), then attacking almost all elements is not sufficient to significantly increase the probability of hypothesis testing. ### Experimental Section: - The effectiveness and robustness of the proposed method are validated on synthetic and real datasets. - Experimental results show that the watermarking method maintains a high detection rate under various conditions and can reliably detect even after noise attacks. - The impact on the usability of the generated data is minimal, further validating the practical application value of the method. In summary, this paper fills a gap in the field of tabular data watermarking and provides a new theoretical foundation and practical application method.