Towards Multi-modal Transformers in Federated Learning

Guangyu Sun,Matias Mendieta,Aritra Dutta,Xin Li,Chen Chen
2024-07-17
Abstract:Multi-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### The Problems Addressed by This Paper This paper primarily explores how to handle the issue of multi-modal transformers within the framework of Federated Learning (FL). Specifically, the paper attempts to address the following core problems: 1. **Training Multi-modal Data under Privacy Protection**: - Traditional multi-modal transformers struggle to improve further when high-quality datasets are under strict privacy protection. Federated Learning, as a privacy-preserving method, allows model training without directly accessing the original data. 2. **Integration of Unpaired Single-modal Clients**: - Current Federated Learning research mainly focuses on single-modal scenarios and lacks effective methods for integrating unpaired single-modal clients. For example, some clients may only have image data, while others may only have text data, making it difficult to directly integrate these clients' data into a multi-modal model. 3. **Cross-modal Knowledge Sharing and Local Training Objective Differences**: - Single-modal clients can only use local data for training and cannot access data from other modalities, leading to cross-modal gaps. Additionally, even within the same modality, the training objectives of single-modal models and multi-modal models differ, resulting in intra-modal gaps. To address these challenges, the paper proposes a new framework called FedCola, which aims to achieve cross-modal knowledge sharing through a unified transformer architecture and introduces collaborative strategies during local training and global aggregation stages to bridge cross-modal and intra-modal gaps. The effectiveness of this framework is validated through various experiments.