Abstract:Trained on massive publicly available data, large language models (LLMs) have demonstrated tremendous success across various fields. While more data contributes to better performance, a disconcerting reality is that high-quality public data will be exhausted in a few years. In this paper, we offer a potential next step for contemporary LLMs: collaborative and privacy-preserving LLM training on the underutilized distributed private data via federated learning (FL), where multiple data owners collaboratively train a shared model without transmitting raw data. To achieve this, we build a concise, integrated, and research-friendly framework/codebase, named OpenFedLLM. It covers federated instruction tuning for enhancing instruction-following capability, federated value alignment for aligning with human values, and 7 representative FL algorithms. Besides, OpenFedLLM supports training on diverse domains, where we cover 8 training datasets; and provides comprehensive evaluations, where we cover 30+ evaluation metrics. Through extensive experiments, we observe that all FL algorithms outperform local training on training LLMs, demonstrating a clear performance improvement across a variety of settings. Notably, in a financial benchmark, Llama2-7B fine-tuned by applying any FL algorithm can outperform GPT-4 by a significant margin while the model obtained through individual training cannot, demonstrating strong motivation for clients to participate in FL. The code is available at this https URL.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the data bottleneck issue faced by current large-scale language models (LLMs) during training. Specifically: 1. **Data Resource Depletion**: Current large language models primarily rely on publicly available large-scale data for training, but it is anticipated that high-quality public data will be exhausted within a few years. 2. **Utilization of Distributed Private Data**: Although there is a large amount of high-quality distributed private data, due to privacy protection, physical limitations, and other reasons, this data cannot be publicly shared and utilized. To tackle these issues, the paper proposes a new solution: collaborative training of large-scale language models through Federated Learning (FL) without directly sharing raw data. This method not only protects data privacy but also fully utilizes dispersed private data resources, thereby improving model performance. ### Specific Goals 1. **Framework Construction**: Develop a concise, integrated, and research-friendly framework—OpenFedLLM, which supports Federated Instruction Tuning (FedIT) and Federated Value Alignment (FedVA), as well as various representative federated learning algorithms. 2. **Function Enhancement**: Enhance the model's instruction-following ability and alignment with human values through federated learning. 3. **Empirical Research**: Validate the effectiveness and advantages of federated learning in training large-scale language models through extensive experiments, particularly in specific fields such as finance. ### Main Contributions 1. **Exploration of the Complete Process**: Thoroughly explored the complete process of fine-tuning contemporary large-scale language models on decentralized private data through federated learning, pointing out a promising development direction. 2. **Integrated Framework**: Proposed an integrated and concise framework, OpenFedLLM, covering instruction tuning, value alignment, various federated learning baseline algorithms, training datasets, and evaluation datasets, suitable for researchers in both the LLMs and FL communities. 3. **Empirical Research**: Conducted comprehensive empirical research based on the OpenFedLLM framework, demonstrating the consistent advantages of federated learning methods in training large-scale language models and providing new insights and directions for future research.

OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning

Federated Large Language Model: Solutions, Challenges and Future Directions

Federated Large Language Models: Current Progress and Future Directions

Towards Federated Large Language Models: Motivations, Methods, and Future Directions

Safely Learning with Private Data: A Federated Learning Framework for Large Language Model

eFedLLM: Efficient LLM Inference Based on Federated Learning

FedJudge: Federated Legal Large Language Model

FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models

LanFL: Differentially Private Federated Learning with Large Language Models using Synthetic Samples

FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom

Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models

CELLM: An Efficient Communication in Large Language Models Training for Federated Learning

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Worldwide Federated Training of Language Models

FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

Federated Foundation Models: Privacy-Preserving and Collaborative Learning for Large Models

The Future of Large Language Model Pre-training is Federated

Can Public Large Language Models Help Private Cross-device Federated Learning?

Personalized Wireless Federated Learning for Large Language Models

FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations