OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning

Rui Ye,Wenhao Wang,Jingyi Chai,Dihan Li,Zexi Li,Yinda Xu,Yaxin Du,Yanfeng Wang,Siheng Chen
2024-02-10
Abstract:Trained on massive publicly available data, large language models (LLMs) have demonstrated tremendous success across various fields. While more data contributes to better performance, a disconcerting reality is that high-quality public data will be exhausted in a few years. In this paper, we offer a potential next step for contemporary LLMs: collaborative and privacy-preserving LLM training on the underutilized distributed private data via federated learning (FL), where multiple data owners collaboratively train a shared model without transmitting raw data. To achieve this, we build a concise, integrated, and research-friendly framework/codebase, named OpenFedLLM. It covers federated instruction tuning for enhancing instruction-following capability, federated value alignment for aligning with human values, and 7 representative FL algorithms. Besides, OpenFedLLM supports training on diverse domains, where we cover 8 training datasets; and provides comprehensive evaluations, where we cover 30+ evaluation metrics. Through extensive experiments, we observe that all FL algorithms outperform local training on training LLMs, demonstrating a clear performance improvement across a variety of settings. Notably, in a financial benchmark, Llama2-7B fine-tuned by applying any FL algorithm can outperform GPT-4 by a significant margin while the model obtained through individual training cannot, demonstrating strong motivation for clients to participate in FL. The code is available at this https URL.
Machine Learning,Computation and Language,Distributed, Parallel, and Cluster Computing,Multiagent Systems
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the data bottleneck issue faced by current large-scale language models (LLMs) during training. Specifically: 1. **Data Resource Depletion**: Current large language models primarily rely on publicly available large-scale data for training, but it is anticipated that high-quality public data will be exhausted within a few years. 2. **Utilization of Distributed Private Data**: Although there is a large amount of high-quality distributed private data, due to privacy protection, physical limitations, and other reasons, this data cannot be publicly shared and utilized. To tackle these issues, the paper proposes a new solution: collaborative training of large-scale language models through Federated Learning (FL) without directly sharing raw data. This method not only protects data privacy but also fully utilizes dispersed private data resources, thereby improving model performance. ### Specific Goals 1. **Framework Construction**: Develop a concise, integrated, and research-friendly framework—OpenFedLLM, which supports Federated Instruction Tuning (FedIT) and Federated Value Alignment (FedVA), as well as various representative federated learning algorithms. 2. **Function Enhancement**: Enhance the model's instruction-following ability and alignment with human values through federated learning. 3. **Empirical Research**: Validate the effectiveness and advantages of federated learning in training large-scale language models through extensive experiments, particularly in specific fields such as finance. ### Main Contributions 1. **Exploration of the Complete Process**: Thoroughly explored the complete process of fine-tuning contemporary large-scale language models on decentralized private data through federated learning, pointing out a promising development direction. 2. **Integrated Framework**: Proposed an integrated and concise framework, OpenFedLLM, covering instruction tuning, value alignment, various federated learning baseline algorithms, training datasets, and evaluation datasets, suitable for researchers in both the LLMs and FL communities. 3. **Empirical Research**: Conducted comprehensive empirical research based on the OpenFedLLM framework, demonstrating the consistent advantages of federated learning methods in training large-scale language models and providing new insights and directions for future research.