WARDEN: Multi-Directional Backdoor Watermarks for Embedding-as-a-Service Copyright Protection

Anudeex Shetty,Yue Teng,Ke He,Qiongkai Xu
2024-06-09
Abstract:Embedding as a Service (EaaS) has become a widely adopted solution, which offers feature extraction capabilities for addressing various downstream tasks in Natural Language Processing (NLP). Prior studies have shown that EaaS can be prone to model extraction attacks; nevertheless, this concern could be mitigated by adding backdoor watermarks to the text embeddings and subsequently verifying the attack models post-publication. Through the analysis of the recent watermarking strategy for EaaS, EmbMarker, we design a novel CSE (Clustering, Selection, Elimination) attack that removes the backdoor watermark while maintaining the high utility of embeddings, indicating that the previous watermarking approach can be breached. In response to this new threat, we propose a new protocol to make the removal of watermarks more challenging by incorporating multiple possible watermark directions. Our defense approach, WARDEN, notably increases the stealthiness of watermarks and has been empirically shown to be effective against CSE attack.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of model extraction attacks in Embedding - as - a - Service (EaaS). Specifically, the paper protects EaaS from these attacks by introducing new defense mechanisms and ensures that the copyright of the embedded service is not infringed. #### Main problem background 1. **Model extraction attack**: - Attackers can collect data and train their own models by querying the API interfaces provided by EaaS, thus replicating the functions of EaaS. - This attack method enables attackers to provide competitive services with lower costs and resources, posing a threat to EaaS providers. 2. **Limitations of existing watermarking techniques**: - Existing watermarking techniques (such as EmbMarker) can add watermarks to verify copyright, but they are easily removed by attackers through specific methods (such as CSE attacks). - After removing the watermark, attackers can still use high - quality embedded services, and EaaS providers cannot detect infringement. #### Main contributions of the paper 1. **Propose the CSE attack framework**: - The CSE (Clustering, Selection, Elimination) attack framework can effectively identify and remove watermarks in embedded services while maintaining high practicality of the embedding. - This attack framework gradually filters and processes embedding vectors that may contain watermarks through three steps: clustering, selection, and elimination. 2. **Design the WARDEN defense mechanism**: - WARDEN is a multi - directional watermark enhancement mechanism. By introducing multiple watermark vectors, it increases the difficulty for attackers to remove all watermarks. - WARDEN not only improves the concealment of watermarks but also shows effective defense capabilities against CSE attacks on various datasets. - WARDEN also designs a verification protocol that allows each watermark to independently verify copyright infringement. #### Formula summary - **Distance difference calculation in CSE attack**: \[ D_p=\text{Rank}(D_v)-\text{Rank}(D_s) \] where \(D_v\) and \(D_s\) are the cosine similarity differences of the victim model and the standard model respectively. - **Multi - directional watermark generation formula in WARDEN**: \[ \text{Norm}\left((1 - \sum_{r = 1}^R\lambda_r(S))\cdot e_o+\sum_{r = 1}^R\lambda_r(S)\cdot w_r\right) \] where \(e_o\) is the original embedding vector, \(w_r\) is the \(r\) - th watermark vector, and \(\lambda_r(S)\) is the trigger word frequency function. Through these methods, the paper successfully solves the problem of model extraction attacks in EaaS and provides a more powerful copyright protection mechanism for future embedded services.