TeDA: A Testing Framework for Data Usage Auditing in Deep Learning Model Development

Xiangshan Gao,Jialuo Chen,Jingyi Wang,Jie Shi,Peng Cheng,Jiming Chen
DOI: https://doi.org/10.1145/3650212.3680375
2024-01-01
Abstract:It is notoriously challenging to audit the potential unauthorized data usage in deep learning (DL) model development lifecycle, i.e., to judge whether certain private user data has been used to train or fine-tune a DL model without authorization. Yet, such data usage auditing is crucial to respond to the urgent requirements of trustworthy Artificial Intelligence (AI) such as data transparency, which are promoted and enforced in recent AI regulation rules or acts like General Data Protection Regulation (GDPR) and EU AI Act. In this work, we propose TeDA, a simple and flexible testing framework for auditing data usage in DL model development process. Given a set of user’s private data to protect (Dp), the intuition of TeDA is to apply membership inference (with good intention) for judging whether the model to audit (Ma) is likely to be trained with Dp. Notably, to significantly expose the usage under membership inference, TeDA applies imperceptible perturbation directed by boundary search to generate a carefully crafted test suite Dt (which we call ‘isotope’) based on Dp. With the test suite, TeDA then adopts membership inference combined with hypothesis testing to decide whether a user’s private data has been used to train Ma with statistical guarantee. We evaluated TeDA through extensive experiments on ranging data volumes across various model architectures for data-sensitive face recognition and medical diagnosis tasks. TeDA demonstrates high feasibility, effectiveness and robustness under various adaptive strategies (e.g., pruning and distillation).
What problem does this paper attempt to address?