TabularMark: Watermarking Tabular Datasets for Machine Learning

Yihao Zheng,Haocheng Xia,Junyuan Pang,Jinfei Liu,Kui Ren,Lingyang Chu,Yang Cao,Li Xiong
2024-06-21
Abstract:Watermarking is broadly utilized to protect ownership of shared data while preserving data utility. However, existing watermarking methods for tabular datasets fall short on the desired properties (detectability, non-intrusiveness, and robustness) and only preserve data utility from the perspective of data statistics, ignoring the performance of downstream ML models trained on the datasets. Can we watermark tabular datasets without significantly compromising their utility for training ML models while preventing attackers from training usable ML models on attacked datasets? In this paper, we propose a hypothesis testing-based watermarking scheme, TabularMark. Data noise partitioning is utilized for data perturbation during embedding, which is adaptable for numerical and categorical attributes while preserving the data utility. For detection, a custom-threshold one proportion z-test is employed, which can reliably determine the presence of the watermark. Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.
Cryptography and Security,Databases,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the deficiencies in the **detectability, non - intrusiveness, and robustness** of existing tabular data watermarking techniques. Specifically, although existing watermarking methods can protect data ownership to a certain extent, they have limitations in the following aspects: 1. **Detectability**: Existing watermarking methods may not be able to reliably detect the presence of watermarks, especially when facing attacks. 2. **Non - intrusiveness**: Existing watermarking methods may significantly affect the quality or usability of data, especially when training machine - learning models. 3. **Robustness**: Existing watermarking methods may be easily removed or destroyed by attackers, causing the watermarks to become ineffective. In addition, existing watermarking methods mainly focus on maintaining the basic statistical properties of data (such as mean and variance) while ignoring the impact on the performance of downstream machine - learning models. Therefore, these methods may affect the performance of machine - learning models trained on watermarked data. To solve these problems, the author proposes a new watermarking scheme based on hypothesis testing - **TabularMark**. This scheme improves existing methods in the following ways: - **Detectability**: Use the one - proportion z - test to detect the presence of watermarks, ensuring reliable watermark detection even under attack. - **Non - intrusiveness**: By controlling the intensity and quantity of perturbations, ensure that the watermark embedding has almost no impact on the machine - learning utility of data. - **Robustness**: By keeping key information (such as the location of key cells) confidential and multi - attribute matching, the anti - attack ability of watermarks is improved. In summary, the goal of this paper is to develop a watermarking scheme that can effectively protect tabular data ownership without affecting the machine - learning utility of data and can resist various malicious attacks.