Enhancing Data Quality in Federated Fine-Tuning of Foundation Models

Wanru Zhao,Yaxin Du,Nicholas Donald Lane,Siheng Chen,Yanfeng Wang
2024-03-07
Abstract:In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among multiple specialized and high-quality private domain data sources. However, the challenge of training models locally without sharing private data presents numerous obstacles in data quality control. To tackle this issue, we propose a data quality control pipeline for federated fine-tuning of foundation models. This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance. Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper aims to address the issue of data quality when fine-tuning foundational models in a federated manner. Specifically, as public domain data gradually becomes exhausted, researchers need to utilize large amounts of private data from different institutions (such as enterprises and user devices) to further expand the model scale. However, directly sharing this private data poses challenges in terms of privacy protection. Therefore, the paper proposes a data quality control pipeline to evaluate the quality of training data without sharing the raw data and to set a globally unified quality standard to improve overall model performance. Experimental results show that this quality control method can effectively enhance the effectiveness and reliability of model training.