FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning

Duy Phuong Nguyen,J. Pablo Munoz,Ali Jannesari
2024-04-12
Abstract:In the rapidly evolving field of artificial intelligence, multimodal models, e.g., integrating vision and language into visual-language models (VLMs), have become pivotal for many applications, ranging from image captioning to multimodal search engines. Among these models, the Contrastive Language-Image Pre-training (CLIP) model has demonstrated remarkable performance in understanding and generating nuanced relationships between text and images. However, the conventional training of such models often requires centralized aggregation of vast datasets, posing significant privacy and data governance challenges. To address these concerns, this paper proposes a novel approach that leverages Federated Learning and parameter-efficient adapters, i.e., Low-Rank Adaptation (LoRA), to train VLMs. This methodology preserves data privacy by training models across decentralized data sources and ensures model adaptability and efficiency through LoRA's parameter-efficient fine-tuning. Our approach accelerates training time by up to 34.72 times and requires 2.47 times less memory usage than full fine-tuning.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in the training of vision - language models (VLMs), especially the challenges in privacy protection and data governance. Specifically: 1. **Challenges in privacy protection and data governance**: - Traditional vision - language models (such as CLIP) require a large amount of centralized labeled data during the training process, which brings significant privacy and data governance problems. For example, centralizing all data in one location for training may lead to the risk of data leakage, and there will be legal and policy limitations when it comes to data transfer across organizations or regions. 2. **Communication efficiency and consumption of computing resources**: - In the federated learning (FL) environment, directly fully fine - tuning large pre - trained models will lead to high communication costs and consumption of computing resources. This is especially disadvantageous for distributed devices (such as mobile devices, edge devices, etc.), because these devices usually have limited computing power and bandwidth. 3. **Model adaptability and flexibility**: - How to ensure that the model can flexibly adapt to different client data distributions while maintaining model performance, especially in the case of non - independent and identically distributed (Non - IID), is an important research direction. To solve these problems, this paper proposes FLoRA (Federated Learning with Low - Rank Adaptation), a method that combines low - rank adapters (LoRA) and federated learning. Through this method, efficient and privacy - protected fine - tuning of vision - language models can be achieved without sacrificing model performance. Specifically: - **Using low - rank adapters (LoRA)**: Only a small part of the model parameters are updated, thus significantly reducing communication overhead and memory usage. - **Federated learning framework**: Training is carried out on distributed data sources, avoiding the privacy problems caused by data centralization. - **Accelerating the training process**: The experimental results show that FLoRA can accelerate the training time by up to 34.72 times, and the required memory usage is 2.47 times less than that of full fine - tuning. In summary, this paper proposes an innovative method that can improve the performance and adaptability of vision - language models in distributed environments while ensuring privacy and communication efficiency.