Tools for Verifying Neural Models' Training Data

Dami Choi,Yonadav Shavit,David Duvenaud
2023-07-03
Abstract:It is important that consumers and regulators can verify the provenance of large neural models to evaluate their capabilities and risks. We introduce the concept of a "Proof-of-Training-Data": any protocol that allows a model trainer to convince a Verifier of the training data that produced a set of model weights. Such protocols could verify the amount and kind of data and compute used to train the model, including whether it was trained on specific harmful or beneficial data sources. We explore efficient verification strategies for Proof-of-Training-Data that are compatible with most current large-model training procedures. These include a method for the model-trainer to verifiably pre-commit to a random seed used in training, and a method that exploits models' tendency to temporarily overfit to training data in order to detect whether a given data-point was included in training. We show experimentally that our verification procedures can catch a wide variety of attacks, including all known attacks from the Proof-of-Learning literature.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The paper attempts to address the issue of how to verify the sources and authenticity of training data for large machine learning models. Currently, users and regulatory bodies rely on trust and reputation to confirm the authenticity of the training data used by AI model providers. However, as the ability to build new AI models becomes more widespread, users need to trust an increasing number of model providers, and regulatory bodies may face malicious developers who might lie to appear compliant with standards and regulations. Worse still, countries developing AI systems of military importance may not trust each other's claims about the capabilities of these systems, making it difficult to reach consensus on limitations. The author introduces the concept of "Proof-of-Training-Data" (PoTD), a protocol that allows model trainers to prove to verifiers the specific training data used to generate a set of model weights. This protocol can verify the amount, type, and computational resources of the data used during training, including whether specific harmful or beneficial data sources were used. The paper explores effective verification strategies compatible with most current large-scale model training processes, including a method for model trainers to verifiably pre-commit to the random seed used in training and a method to detect whether a given data point was included in the training by leveraging the model's temporary overfitting to the training data. Through experiments, the author demonstrates that their verification procedure can capture various attacks, including all known attacks in the Proof-of-Learning literature. In summary, the paper aims to enhance trust in training data by providing a set of verification tools, thereby ensuring the authenticity and reliability of AI models.