Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training

Ye Lin Tun,Chu Myaet Thwal,Minh N. H. Nguyen,Choong Seon Hong
2024-10-21
Abstract:Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving approaches like federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models. Training these models requires significant resources, presenting a substantial challenge for FL clients operating with limited computation and communication resources. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach which decomposes the training process into multiple stages. Each stage focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL and multimodal learning settings to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to $2.7\times$, computational operations (FLOPs) by $2.4\times$, and total communication cost by $2.3\times$. We also explore a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to conduct multimodal learning (Multimodal Learning) efficiently in a resource - constrained federated learning (Federated Learning, FL) environment**. Specifically, multimodal learning needs to process information from different data sources (such as images, audio, etc.), which makes the model more complex and large - scale, thereby placing higher demands on computing and communication resources. However, many FL clients (such as edge devices) usually have limited computing and communication capabilities and it is difficult to support such complex multimodal model training. To address these challenges, the authors propose two methods: 1. **LW - FedMML (Layer - wise Federated Multimodal Learning)**: Through the method of hierarchical training, the entire training process is decomposed into multiple stages, and only a part of the model is trained at each stage. This can significantly reduce memory and computing requirements and lower communication costs. 2. **Prog - FedMML (Progressive Federated Multimodal Learning)**: Adopting a step - by - step training method, at each stage, not only new layers are added, but all previous layers are also continuously trained. Although this method is not as resource - efficient as LW - FedMML, it has the potential to outperform traditional end - to - end multimodal federated learning (FedMML) in performance. ### Summary of Main Problems - **Resource Limitations**: Multimodal learning requires more computing and communication resources, while many FL clients (such as edge devices) have limited resources. - **Model Complexity**: Multimodal models are larger and more complex due to the need to process multiple data types, increasing the difficulty of training. - **Communication Bottleneck**: Large - scale multimodal models need to be frequently uploaded and downloaded during the FL process, which is prone to causing a communication bottleneck. ### Solutions - **LW - FedMML**: Reduces the resource requirements at each stage through hierarchical training and lowers communication costs. - **Prog - FedMML**: Improves model performance through step - by - step training, but has higher resource requirements. ### Experimental Results The experimental results show that LW - FedMML can achieve performance comparable to that of traditional end - to - end multimodal federated learning with significantly reduced resource consumption. Although Prog - FedMML has higher resource consumption, it can outperform the performance of traditional methods in some cases. ### Example of Formula In the instance discrimination task, the loss function is defined as follows: \[ \ell(za, zb)=-\frac{1}{B} \sum_{i = 1}^{B} \log \frac{\exp(z_{i}^{a} \cdot z_{i}^{b}/\tau)}{\sum_{j = 1}^{B} \exp(z_{i}^{a} \cdot z_{j}^{b}/\tau)} \] where \(B\) is the batch size and \(\tau\) is the temperature parameter. Through these methods, the authors effectively solve the resource limitation problem in multimodal federated learning and increase the possibility of resource - constrained devices participating in model training.