Safely Learning with Private Data: A Federated Learning Framework for Large Language Model

JiaYing Zheng,HaiNan Zhang,LingXiang Wang,WangJie Qiu,HongWei Zheng,ZhiMing Zheng
2024-06-26
Abstract:Private data, being larger and quality-higher than public data, can greatly improve large language models (LLM). However, due to privacy concerns, this data is often dispersed in multiple silos, making its secure utilization for LLM training a challenge. Federated learning (FL) is an ideal solution for training models with distributed private data, but traditional frameworks like FedAvg are unsuitable for LLM due to their high computational demands on clients. An alternative, split learning, offloads most training parameters to the server while training embedding and output layers locally, making it more suitable for LLM. Nonetheless, it faces significant challenges in security and efficiency. Firstly, the gradients of embeddings are prone to attacks, leading to potential reverse engineering of private data. Furthermore, the server's limitation of handle only one client's training request at a time hinders parallel training, severely impacting training efficiency. In this paper, we propose a Federated Learning framework for LLM, named FL-GLM, which prevents data leakage caused by both server-side and peer-client attacks while improving training efficiency. Specifically, we first place the input block and output block on local client to prevent embedding gradient attacks from server. Secondly, we employ key-encryption during client-server communication to prevent reverse engineering attacks from peer-clients. Lastly, we employ optimization methods like client-batching or server-hierarchical, adopting different acceleration methods based on the actual computational capabilities of the server. Experimental results on NLU and generation tasks demonstrate that FL-GLM achieves comparable metrics to centralized chatGLM model, validating the effectiveness of our federated learning framework.
Cryptography and Security,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to use private data scattered in multiple data silos to train large - language models (LLMs) while protecting user privacy. Specifically, the paper focuses on: 1. **Privacy protection**: Due to privacy issues, private data is usually stored in different devices or institutions and is difficult to be used centrally. Although traditional federated learning frameworks such as FedAvg can protect data privacy to a certain extent, they are not suitable for training large - language models because of their high requirements for client - side computing resources. 2. **Computing efficiency**: Although existing hierarchical learning methods (such as FedBERT) can partially solve the problem of computing resources, there are still security problems (such as embedding gradient attacks) and problems of low training efficiency. To meet these challenges, the paper proposes a new federated learning framework - FL - GLM, aiming at: - **Preventing data leakage**: By placing input blocks and output blocks on local clients, embedding gradient attacks on the server side are prevented. At the same time, key encryption is adopted when clients communicate with the server to prevent reverse - engineering attacks from other clients. - **Improving training efficiency**: Through the methods of client - batching and server - hierarchical, the parallelism and efficiency of training are improved. ### Specific contributions of the paper 1. **Designed a federated learning framework specifically for large - language models**: Starting from user privacy protection and computing resource requirements, the hierarchical learning method is improved, and a reasonable, effective and secure federated learning framework is developed. 2. **Proposed optimization methods for client - batching and server - hierarchical**: According to the computing power of the server, multiple methods for accelerating training are proposed, which solves the problem of low training efficiency in hierarchical learning. 3. **Experimentally verified the effectiveness of the framework**: The experimental results on SuperGLUE and abstract generation tasks show that the FL - GLM model can achieve performance comparable to that of the centralized ChatGLM model, verifying the effectiveness of the framework. ### Summary By proposing the FL - GLM framework, this paper successfully solves the problem of using private data to train large - language models while protecting user privacy, and improves the training efficiency through various optimization methods. The experimental results further prove the effectiveness and practicality of this framework.