FedMix: Boosting with Data Mixture for Vertical Federated Learning

Yihang Cheng,Lan Zhang,Junyang Wang,Xiaokai Chu,Dongbo Huang,Lan Xu
DOI: https://doi.org/10.1109/icde60146.2024.00261
2024-01-01
Abstract:The need to safeguard data privacy and adhere to regulations such as GDPR creates data silos and has prompted the emergence and widespread adoption of techniques for distributed databases. To effectively explore the value of data across multiple organizations, techniques for data management, data analysis and data functionality from distributed databases have been proposed. Recently, Vertical Federated Learning (VFL) has become a solution with growing interests, which enables collaborative model training when data features are partitioned into multiple parts and are held by different parties. However, typical VFL methods heavily rely on private set intersection (PSI) to align data before training and only utilize aligned data for training. In this work, we provide a theoretical analysis to show that unaligned data actually contains valuable and rich features, and a thoughtful design that harnesses the potential of unaligned samples to significantly improve the performance of VFL models. Regrettably, many existing methods simply discard unaligned data, resulting in an irrecoverable loss of performance. To address this data sacrifice problem, we introduce the concept of data mixture, which enables the utilization of both aligned and unaligned data during training. Building upon the data mixture idea, we present FedMix, the first on-the-fly and distribution-agnostic framework designed to boost the performance of VFL models by leveraging unaligned data. A data seasoning approach is also designed to utilize auxiliary data lacking label information. Evaluations on diverse datasets under different settings demonstrate the effectiveness of the proposed FedMix compared with various SOTA approaches. FedMix achieves up to 15% model performance improvement and 30.5 hours time cost reduction.
What problem does this paper attempt to address?