Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

Liwei Che,Jiaqi Wang,Xinyue Liu,Fenglong Ma
2024-06-17
Abstract:Federated learning (FL) has obtained tremendous progress in providing collaborative training solutions for distributed data silos with privacy guarantees. However, few existing works explore a more realistic scenario where the clients hold multiple data modalities. In this paper, we aim to solve a novel challenge in multi-modal federated learning (MFL) -- modality missing -- the clients may lose part of the modalities in their local data sets. To tackle the problems, we propose a novel multi-modal federated learning method, Federated Multi-modal contrastiVe training with Pre-trained completion (FedMVP), which integrates the large-scale pre-trained models to enhance the federated training. In the proposed FedMVP framework, each client deploys a large-scale pre-trained model with frozen parameters for modality completion and representation knowledge transfer, enabling efficient and robust local training. On the server side, we utilize generated data to uniformly measure the representation similarity among the uploaded client models and construct a graph perspective to aggregate them according to their importance in the system. We demonstrate that the model achieves superior performance over two real-world image-text classification datasets and is robust to the performance degradation caused by missing modality.
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper attempts to solve the performance degradation problem caused by the lack of data modalities in multi - modal federated learning (MFL). Specifically, the paper focuses on the fact that in the federated learning environment, the data sets held by clients may lack some modalities (for example, some instances in image - text pairs have only images or only text), which is very common in real - world scenarios. This problem of missing modalities can seriously affect the learning ability and performance of the model. To address this challenge, the authors propose a new method named Federated Multi - modal contrastive training with Pre - trained completion (FedMVP). FedMVP solves the problem of missing modalities in the following ways: 1. **Modal Completion Module**: Utilize the cross - modal generation ability of large - scale pre - trained models to complete the missing modal data. For example, use DALLE2 for text - to - image generation and BLIP2 for image - to - text generation. 2. **Multi - modal Joint Learning Module**: Design a multi - modal joint encoder that can effectively fuse image and text information to generate high - quality joint representations. 3. **Knowledge Transfer**: Transfer knowledge from pre - trained models through contrastive learning and Representation Alignment Margin (RAM) loss to improve the representation learning performance of local models. 4. **Server - side Aggregation**: Propose an aggregation method based on the similarity of model output representations to enhance the representation ability of the global model. The paper verifies the effectiveness of FedMVP through experiments on two real - world image - text classification data sets (CUB - 200 and Oxford Flower). The results show that FedMVP can still maintain high performance in the case of missing modalities and is more robust than existing baseline methods.