FoldMark: Protecting Protein Generative Models with Watermarking

Zaixi Zhang,Ruofan Jin,Kaidi Fu,Le Cong,Marinka Zitnik,Mengdi Wang
2024-10-27
Abstract:Protein structure is key to understanding protein function and is essential for progress in bioengineering, drug discovery, and molecular biology. Recently, with the incorporation of generative AI, the power and accuracy of computational protein structure prediction/design have been improved significantly. However, ethical concerns such as copyright protection and harmful content generation (biosecurity) pose challenges to the wide implementation of protein generative models. Here, we investigate whether it is possible to embed watermarks into protein generative models and their outputs for copyright authentication and the tracking of generated structures. As a proof of concept, we propose a two-stage method FoldMark as a generalized watermarking strategy for protein generative models. FoldMark first pretrain watermark encoder and decoder, which can minorly adjust protein structures to embed user-specific information and faithfully recover the information from the encoded structure. In the second step, protein generative models are fine-tuned with watermark Low-Rank Adaptation (LoRA) modules to preserve generation quality while learning to generate watermarked structures with high recovery rates. Extensive experiments are conducted on open-source protein structure prediction models (e.g., ESMFold and MultiFlow) and de novo structure design models (e.g., FrameDiff and FoldFlow) and we demonstrate that our method is effective across all these generative models. Meanwhile, our watermarking framework only exerts a negligible impact on the original protein structure quality and is robust under potential post-processing and adaptive attacks.
Cryptography and Security,Machine Learning,Biomolecules
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are: in the context of the rapid development of generative AI technology, how to protect the copyright of protein - generation models and ensure that they will not be misused. Specifically, the paper focuses on: 1. **Copyright protection**: With the widespread sharing and use of protein - generation models, the unauthorized use of generated structures or the redistribution of pre - trained models for profit - making purposes is increasing, which harms the interests of the original creators. 2. **Biosafety**: Powerful protein - generation models are prone to misuse. For example, new proteins with harmful properties (such as pathogens, toxins or viruses) can be designed, which may be used as biological weapons, thus causing biosafety problems. To solve these problems, the paper proposes a general watermarking method named FoldMark, which aims to embed watermarks into protein - generation models and their outputs to achieve copyright verification and tracking of generated structures. FoldMark achieves this goal in the following ways: - **Two - stage method**: - **First stage**: Pre - train SE(3)-equivariant watermark encoders and decoders to learn how to embed watermark information without compromising the structural quality. - **Second stage**: Introduce the watermark low - rank adaptation (LoRA) module to fine - tune the protein - generation model so that it can generate structures with high - recovery - rate watermarks while maintaining the generation quality. Through this method, FoldMark can reliably embed and extract watermark information without affecting the quality of protein structures, thus providing an effective copyright protection and tracking mechanism for protein - generation models. Experimental results show that FoldMark performs well on a variety of protein - generation models and is robust against post - processing and adaptive attacks.