Abstract:Federated learning (FL) has obtained tremendous progress in providing collaborative training solutions for distributed data silos with privacy guarantees. However, few existing works explore a more realistic scenario where the clients hold multiple data modalities. In this paper, we aim to solve a novel challenge in multi-modal federated learning (MFL) -- modality missing -- the clients may lose part of the modalities in their local data sets. To tackle the problems, we propose a novel multi-modal federated learning method, Federated Multi-modal contrastiVe training with Pre-trained completion (FedMVP), which integrates the large-scale pre-trained models to enhance the federated training. In the proposed FedMVP framework, each client deploys a large-scale pre-trained model with frozen parameters for modality completion and representation knowledge transfer, enabling efficient and robust local training. On the server side, we utilize generated data to uniformly measure the representation similarity among the uploaded client models and construct a graph perspective to aggregate them according to their importance in the system. We demonstrate that the model achieves superior performance over two real-world image-text classification datasets and is robust to the performance degradation caused by missing modality.

What problem does this paper attempt to address?

This paper attempts to solve the performance degradation problem caused by the lack of data modalities in multi - modal federated learning (MFL). Specifically, the paper focuses on the fact that in the federated learning environment, the data sets held by clients may lack some modalities (for example, some instances in image - text pairs have only images or only text), which is very common in real - world scenarios. This problem of missing modalities can seriously affect the learning ability and performance of the model. To address this challenge, the authors propose a new method named Federated Multi - modal contrastive training with Pre - trained completion (FedMVP). FedMVP solves the problem of missing modalities in the following ways: 1. **Modal Completion Module**: Utilize the cross - modal generation ability of large - scale pre - trained models to complete the missing modal data. For example, use DALLE2 for text - to - image generation and BLIP2 for image - to - text generation. 2. **Multi - modal Joint Learning Module**: Design a multi - modal joint encoder that can effectively fuse image and text information to generate high - quality joint representations. 3. **Knowledge Transfer**: Transfer knowledge from pre - trained models through contrastive learning and Representation Alignment Margin (RAM) loss to improve the representation learning performance of local models. 4. **Server - side Aggregation**: Propose an aggregation method based on the similarity of model output representations to enhance the representation ability of the global model. The paper verifies the effectiveness of FedMVP through experiments on two real - world image - text classification data sets (CUB - 200 and Oxford Flower). The results show that FedMVP can still maintain high performance in the case of missing modalities and is more robust than existing baseline methods.

Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

FedMMR: Multi-Modal Federated Learning Via Missing Modality Reconstruction

Multimodal Federated Learning: A Survey

FedMFS: Federated Multimodal Fusion Learning with Selective Modality Communication

Communication-Efficient Multimodal Federated Learning: Joint Modality and Client Selection

Open-Vocabulary Federated Learning with Multimodal Prototyping

Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

A unified framework for multi-modal federated learning

Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection

Multimodal Federated Learning via Contrastive Representation Ensemble

FedMAC: Tackling Partial-Modality Missing in Federated Learning with Cross-Modal Aggregation and Contrastive Regularization

Towards Multi-modal Transformers in Federated Learning

FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning

Multimodal Fusion with Block Term Decomposition for Asynchronous Federated Learning

FedMVT: Semi-supervised Vertical Federated Learning with MultiView Training.

Cross-Modal Prototype based Multimodal Federated Learning under Severely Missing Modality

FedMultimodal: A Benchmark for Multimodal Federated Learning

FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data