Abstract:In the rapidly evolving field of artificial intelligence, multimodal models, e.g., integrating vision and language into visual-language models (VLMs), have become pivotal for many applications, ranging from image captioning to multimodal search engines. Among these models, the Contrastive Language-Image Pre-training (CLIP) model has demonstrated remarkable performance in understanding and generating nuanced relationships between text and images. However, the conventional training of such models often requires centralized aggregation of vast datasets, posing significant privacy and data governance challenges. To address these concerns, this paper proposes a novel approach that leverages Federated Learning and parameter-efficient adapters, i.e., Low-Rank Adaptation (LoRA), to train VLMs. This methodology preserves data privacy by training models across decentralized data sources and ensures model adaptability and efficiency through LoRA's parameter-efficient fine-tuning. Our approach accelerates training time by up to 34.72 times and requires 2.47 times less memory usage than full fine-tuning.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in the training of vision - language models (VLMs), especially the challenges in privacy protection and data governance. Specifically: 1. **Challenges in privacy protection and data governance**: - Traditional vision - language models (such as CLIP) require a large amount of centralized labeled data during the training process, which brings significant privacy and data governance problems. For example, centralizing all data in one location for training may lead to the risk of data leakage, and there will be legal and policy limitations when it comes to data transfer across organizations or regions. 2. **Communication efficiency and consumption of computing resources**: - In the federated learning (FL) environment, directly fully fine - tuning large pre - trained models will lead to high communication costs and consumption of computing resources. This is especially disadvantageous for distributed devices (such as mobile devices, edge devices, etc.), because these devices usually have limited computing power and bandwidth. 3. **Model adaptability and flexibility**: - How to ensure that the model can flexibly adapt to different client data distributions while maintaining model performance, especially in the case of non - independent and identically distributed (Non - IID), is an important research direction. To solve these problems, this paper proposes FLoRA (Federated Learning with Low - Rank Adaptation), a method that combines low - rank adapters (LoRA) and federated learning. Through this method, efficient and privacy - protected fine - tuning of vision - language models can be achieved without sacrificing model performance. Specifically: - **Using low - rank adapters (LoRA)**: Only a small part of the model parameters are updated, thus significantly reducing communication overhead and memory usage. - **Federated learning framework**: Training is carried out on distributed data sources, avoiding the privacy problems caused by data centralization. - **Accelerating the training process**: The experimental results show that FLoRA can accelerate the training time by up to 34.72 times, and the required memory usage is 2.47 times less than that of full fine - tuning. In summary, this paper proposes an innovative method that can improve the performance and adaptability of vision - language models in distributed environments while ensuring privacy and communication efficiency.

FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning

Federated Fine-tuning of Large Language Models under Heterogeneous Tasks and Client Resources

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models

FLoCoRA: Federated learning compression with low-rank adaptation

Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization

Low-Rank Few-Shot Adaptation of Vision-Language Models

AdvLoRA: Adversarial Low-Rank Adaptation of Vision-Language Models

CELLM: An Efficient Communication in Large Language Models Training for Federated Learning

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA

SA-FedLora: Adaptive Parameter Allocation for Efficient Federated Learning with LoRA Tuning

Improving LoRA in Privacy-preserving Federated Learning

FeDeRA:Efficient Fine-tuning of Language Models in Federated Learning Leveraging Weight Decomposition

Federated LLMs Fine-tuned with Adaptive Importance-Aware LoRA

Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

FedLoRA: Model-Heterogeneous Personalized Federated Learning with LoRA Tuning

Low-Parameter Federated Learning with Large Language Models

FairLoRA: Unpacking Bias Mitigation in Vision Models with Fairness-Driven Low-Rank Adaptation