Abstract:Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving approaches like federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models. Training these models requires significant resources, presenting a substantial challenge for FL clients operating with limited computation and communication resources. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach which decomposes the training process into multiple stages. Each stage focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL and multimodal learning settings to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to $2.7\times$, computational operations (FLOPs) by $2.4\times$, and total communication cost by $2.3\times$. We also explore a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to conduct multimodal learning (Multimodal Learning) efficiently in a resource - constrained federated learning (Federated Learning, FL) environment**. Specifically, multimodal learning needs to process information from different data sources (such as images, audio, etc.), which makes the model more complex and large - scale, thereby placing higher demands on computing and communication resources. However, many FL clients (such as edge devices) usually have limited computing and communication capabilities and it is difficult to support such complex multimodal model training. To address these challenges, the authors propose two methods: 1. **LW - FedMML (Layer - wise Federated Multimodal Learning)**: Through the method of hierarchical training, the entire training process is decomposed into multiple stages, and only a part of the model is trained at each stage. This can significantly reduce memory and computing requirements and lower communication costs. 2. **Prog - FedMML (Progressive Federated Multimodal Learning)**: Adopting a step - by - step training method, at each stage, not only new layers are added, but all previous layers are also continuously trained. Although this method is not as resource - efficient as LW - FedMML, it has the potential to outperform traditional end - to - end multimodal federated learning (FedMML) in performance. ### Summary of Main Problems - **Resource Limitations**: Multimodal learning requires more computing and communication resources, while many FL clients (such as edge devices) have limited resources. - **Model Complexity**: Multimodal models are larger and more complex due to the need to process multiple data types, increasing the difficulty of training. - **Communication Bottleneck**: Large - scale multimodal models need to be frequently uploaded and downloaded during the FL process, which is prone to causing a communication bottleneck. ### Solutions - **LW - FedMML**: Reduces the resource requirements at each stage through hierarchical training and lowers communication costs. - **Prog - FedMML**: Improves model performance through step - by - step training, but has higher resource requirements. ### Experimental Results The experimental results show that LW - FedMML can achieve performance comparable to that of traditional end - to - end multimodal federated learning with significantly reduced resource consumption. Although Prog - FedMML has higher resource consumption, it can outperform the performance of traditional methods in some cases. ### Example of Formula In the instance discrimination task, the loss function is defined as follows: \[ \ell(za, zb)=-\frac{1}{B} \sum_{i = 1}^{B} \log \frac{\exp(z_{i}^{a} \cdot z_{i}^{b}/\tau)}{\sum_{j = 1}^{B} \exp(z_{i}^{a} \cdot z_{j}^{b}/\tau)} \] where $B$ is the batch size and $\tau$ is the temperature parameter. Through these methods, the authors effectively solve the resource limitation problem in multimodal federated learning and increase the possibility of resource - constrained devices participating in model training.

Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training

Communication-Efficient Multimodal Federated Learning: Joint Modality and Client Selection

FedMFS: Federated Multimodal Fusion Learning with Selective Modality Communication

FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning with Hierarchical Aggregation

A unified framework for multi-modal federated learning

Multimodal Federated Learning via Contrastive Representation Ensemble

FedMM: Federated Multi-Modal Learning with Modality Heterogeneity in Computational Pathology

Multimodal Federated Learning: A Survey

Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast

Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

Multimodal Federated Learning

Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

Multimodal federated learning: Concept, methods, applications and future directions

FedMultimodal: A Benchmark for Multimodal Federated Learning

A survey of multimodal federated learning: background, applications, and perspectives

Cross-Modal Prototype based Multimodal Federated Learning under Severely Missing Modality