Abstract:Watermarking is broadly utilized to protect ownership of shared data while preserving data utility. However, existing watermarking methods for tabular datasets fall short on the desired properties (detectability, non-intrusiveness, and robustness) and only preserve data utility from the perspective of data statistics, ignoring the performance of downstream ML models trained on the datasets. Can we watermark tabular datasets without significantly compromising their utility for training ML models while preventing attackers from training usable ML models on attacked datasets? In this paper, we propose a hypothesis testing-based watermarking scheme, TabularMark. Data noise partitioning is utilized for data perturbation during embedding, which is adaptable for numerical and categorical attributes while preserving the data utility. For detection, a custom-threshold one proportion z-test is employed, which can reliably determine the presence of the watermark. Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the deficiencies in the **detectability, non - intrusiveness, and robustness** of existing tabular data watermarking techniques. Specifically, although existing watermarking methods can protect data ownership to a certain extent, they have limitations in the following aspects: 1. **Detectability**: Existing watermarking methods may not be able to reliably detect the presence of watermarks, especially when facing attacks. 2. **Non - intrusiveness**: Existing watermarking methods may significantly affect the quality or usability of data, especially when training machine - learning models. 3. **Robustness**: Existing watermarking methods may be easily removed or destroyed by attackers, causing the watermarks to become ineffective. In addition, existing watermarking methods mainly focus on maintaining the basic statistical properties of data (such as mean and variance) while ignoring the impact on the performance of downstream machine - learning models. Therefore, these methods may affect the performance of machine - learning models trained on watermarked data. To solve these problems, the author proposes a new watermarking scheme based on hypothesis testing - **TabularMark**. This scheme improves existing methods in the following ways: - **Detectability**: Use the one - proportion z - test to detect the presence of watermarks, ensuring reliable watermark detection even under attack. - **Non - intrusiveness**: By controlling the intensity and quantity of perturbations, ensure that the watermark embedding has almost no impact on the machine - learning utility of data. - **Robustness**: By keeping key information (such as the location of key cells) confidential and multi - attribute matching, the anti - attack ability of watermarks is improved. In summary, the goal of this paper is to develop a watermarking scheme that can effectively protect tabular data ownership without affecting the machine - learning utility of data and can resist various malicious attacks.

TabularMark: Watermarking Tabular Datasets for Machine Learning

Leveraging Unlabeled Data for Watermark Removal of Deep Neural Networks

Watermarking Generative Tabular Data

Adaptive and Robust Watermark for Generative Tabular Data

Watermarking Text Data on Large Language Models for Dataset Copyright

CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models

Did You Train on My Dataset? Towards Public Dataset Protection with Clean-Label Backdoor Watermarking

Proving membership in LLM pretraining data via data watermarks

Making Watermark Survive Model Extraction Attacks in Graph Neural Networks.

WaterPark: A Robustness Assessment of Language Model Watermarking

Pairwise Open-Sourced Dataset Protection Based on Adaptive Blind Watermarking

Clean-Label Backdoor Watermarking for Dataset Copyright Protection via Trigger Optimization

Watermarking Counterfactual Explanations

Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data

A Statistical Characteristics Preserving Watermarking Scheme for Time Series Databases

Segmenting Watermarked Texts From Language Models

PointNCBW: Towards Dataset Ownership Verification for Point Clouds via Negative Clean-label Backdoor Watermark

PersonaMark: Personalized LLM watermarking for model protection and user attribution

Data Watermarking for Sequential Recommender Systems

ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

Study of the Watermark Source's Topology Role on Relational Data Watermarking Robustness