Towards Multi-modal Transformers in Federated Learning

Guangyu Sun,Matias Mendieta,Aritra Dutta,Xin Li,Chen Chen

2024-07-17

Abstract:Multi-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

### The Problems Addressed by This Paper This paper primarily explores how to handle the issue of multi-modal transformers within the framework of Federated Learning (FL). Specifically, the paper attempts to address the following core problems: 1. **Training Multi-modal Data under Privacy Protection**: - Traditional multi-modal transformers struggle to improve further when high-quality datasets are under strict privacy protection. Federated Learning, as a privacy-preserving method, allows model training without directly accessing the original data. 2. **Integration of Unpaired Single-modal Clients**: - Current Federated Learning research mainly focuses on single-modal scenarios and lacks effective methods for integrating unpaired single-modal clients. For example, some clients may only have image data, while others may only have text data, making it difficult to directly integrate these clients' data into a multi-modal model. 3. **Cross-modal Knowledge Sharing and Local Training Objective Differences**: - Single-modal clients can only use local data for training and cannot access data from other modalities, leading to cross-modal gaps. Additionally, even within the same modality, the training objectives of single-modal models and multi-modal models differ, resulting in intra-modal gaps. To address these challenges, the paper proposes a new framework called FedCola, which aims to achieve cross-modal knowledge sharing through a unified transformer architecture and introduces collaborative strategies during local training and global aggregation stages to bridge cross-modal and intra-modal gaps. The effectiveness of this framework is validated through various experiments.

Towards Multi-modal Transformers in Federated Learning

Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning with Hierarchical Aggregation

Federated Transformer: Multi-Party Vertical Federated Learning on Practical Fuzzily Linked Data

The Prospect of Enhancing Large-Scale Heterogeneous Federated Learning with Transformers

Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

Transformer-based Federated Learning for Multi-Label Remote Sensing Image Classification

FedTrans: Efficient Federated Learning via Multi-Model Transformation

Multimodal Federated Learning: A Survey

A unified framework for multi-modal federated learning

Open-Vocabulary Federated Learning with Multimodal Prototyping

Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning

FedYolo: Augmenting Federated Learning with Pretrained Transformers

Multimodal Federated Learning

FedMultimodal: A Benchmark for Multimodal Federated Learning

FedMVT: Semi-supervised Vertical Federated Learning with MultiView Training.

FedCross: Towards Accurate Federated Learning via Multi-Model Cross-Aggregation

FedMFS: Federated Multimodal Fusion Learning with Selective Modality Communication

FedTune: A Deep Dive into Efficient Federated Fine-Tuning with Pre-trained Transformers

FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning