Abstract:While NLP models significantly impact our lives, there are rising concerns about privacy invasion. Although federated learning enhances privacy, attackers may recover private training data by exploiting model parameters and gradients. Therefore, protecting against such embedding attacks remains an open challenge. To address this, we propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks, making input text recovery harder. Importantly, our method requires a smaller memory with $256$ bytes of vocabulary while keeping efficiency with the same input length. Thus, our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. Our experiments show SEB can effectively protect against embedding-based attacks from recovering original sentences in federated learning. Meanwhile, we verify that SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in natural language processing (NLP) models, how to protect user privacy without sacrificing accuracy and efficiency. Specifically, the article focuses on the problem of preventing attacks that recover private training data through model parameters and gradients in the Federated Learning (FL) environment. ### Main problems 1. **Privacy protection**: - In Federated Learning, although only model updates are sent to the central server, attackers may still use this update information to reconstruct the original data, thus leaking sensitive information. - In particular, in embedding - based attacks, attackers can reconstruct text by extracting candidate words from the embedding gradients and using methods such as beam search. 2. **Maintaining efficiency and accuracy**: - Existing privacy - protection methods (such as encryption, differential privacy, etc.) usually sacrifice the efficiency or accuracy of the model. - How to maintain the efficiency and accuracy of the model while protecting privacy is a challenge. ### Proposed solutions To address the above problems, the paper proposes the Subword Embedding from Bytes (SEB) method, whose main features include: - **Byte embedding**: Encode subwords into byte sequences, making it more difficult to recover the input text. - **Small vocabulary**: SEB uses a smaller byte vocabulary (256 bytes), which reduces memory usage while keeping the input sequence length unchanged. - **Aggregated byte embedding**: Aggregate byte embeddings into subword embeddings through a deep neural network, thus maintaining the efficiency of the model. ### Experimental verification The paper experimentally verifies the effectiveness of SEB as follows: - **Privacy protection**: SEB can effectively defend against embedding - gradient - based attacks, making it difficult for attackers to recover the original sentence from the updated embeddings. - **Performance**: In machine translation, sentiment analysis, and language modeling tasks, SEB can not only maintain results comparable to or even better than traditional subword embedding methods, but also has lower time and space complexity. In conclusion, this paper aims to solve the privacy protection problem in the Federated Learning environment and proposes a new method that can protect privacy without affecting model performance.

Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Privacy-Preserving Collaborative Model Learning: the Case of Word Vector Training

A Privacy-Preserving Word Embedding Text Classification Model Based on Privacy Boundary Constructed by Deep Belief Network

Information Leakage from Embedding in Large Language Models

Understanding Privacy Risks of Embeddings Induced by Large Language Models

Mitigating Privacy Risks in LLM Embeddings from Embedding Inversion

Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems

Word-Level Representation From Bytes For Language Modeling

Two Models are Better than One: Federated Learning Is Not Private For Google GBoard Next Word Prediction

Embedding is Not Cipher: Understanding the Risk of Embedding Leakages

TextObfuscator: Making Pre-trained Language Model a Privacy Protector via Obfuscating Word Representations

Split-and-Denoise: Protect large language model inference with local differential privacy

Hiding in Plain Sight: Disguising Data Stealing Attacks in Federated Learning

Lightweight Efficient Multi-keyword Ranked Search over Encrypted Cloud Data using Dual Word Embeddings

Privacy-Preserving End-to-End Spoken Language Understanding

SDBA: A Stealthy and Long-Lasting Durable Backdoor Attack in Federated Learning

Defense of Word-level Adversarial Attacks via Random Substitution Encoding

Mitigating Unintended Memorization in Language Models via Alternating Teaching

Subword-augmented Embedding for Cloze Reading Comprehension

A TEE-Based Federated Privacy Protection Method: Proposal and Implementation

Transferable Embedding Inversion Attack: Uncovering Privacy Risks in Text Embeddings without Model Queries