Pramit Saha,Divyanshu Mishra,Felix Wagner,Konstantinos Kamnitsas,J. Alison Noble
Abstract:Large Vision-Language Models typically require large text and image datasets for effective fine-tuning. However, collecting data from various sites, especially in healthcare, is challenging due to strict privacy regulations. An alternative is to fine-tune these models on end-user devices, such as in medical clinics, without sending data to a server. These local clients typically have limited computing power and small datasets, which are not enough for fully fine-tuning large VLMs on their own. A naive solution to these scenarios is to leverage parameter-efficient fine-tuning (PEFT) strategies and apply federated learning (FL) algorithms to combine the learned adapter weights, thereby respecting the resource limitations and data privacy. However, this approach does not fully leverage the knowledge from multiple adapters trained on diverse data distributions and for diverse tasks. The adapters are adversely impacted by data heterogeneity and task heterogeneity across clients resulting in suboptimal convergence. To this end, we propose a novel framework called FedPIA that improves upon the naive combinations of FL and PEFT by introducing Permutation and Integration of the local Adapters in the server and global Adapters in the clients exploiting Wasserstein barycenters for improved blending of client-specific and client-agnostic knowledge. This layerwise permutation helps to bridge the gap in the parameter space of local and global adapters before integration. We conduct over 2000 client-level experiments utilizing 48 medical image datasets across five different medical vision-language FL task settings encompassing visual question answering as well as image and report-based multi-label disease detection. Our experiments involving diverse client settings, ten different modalities, and two VLM backbones demonstrate that FedPIA consistently outperforms the state-of-the-art PEFT-FL baselines.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve the challenges faced in efficiently fine - tuning large Vision - Language Models (VLMs) in Multi - Modal Federated Learning (MMFL). Specifically, the paper mainly focuses on the following problems:
1. **Data privacy and resource limitations**:
- Large VLMs usually require a large amount of text and image data for effective fine - tuning. However, in sensitive areas such as healthcare, collecting data from different sites is very difficult due to strict privacy regulations.
- Local clients such as medical clinics and hospitals usually have limited computing resources and small data sets, which are not sufficient to fully fine - tune large VLMs.
2. **Data heterogeneity and task heterogeneity**:
- The data distribution and task requirements of different clients vary greatly (i.e., data heterogeneity and task heterogeneity), which leads to a large distance between local adapters in the parameter space, thus affecting the convergence and performance of the model.
- Simply combining adapters from different clients cannot fully utilize the knowledge embedded in these adapters, resulting in sub - optimal convergence effects.
3. **Limitations of existing methods**:
- Existing methods such as FedAvg, AdapterFusion, etc. can solve the problem to a certain extent, but they fail to fully cope with the heterogeneity of data and tasks, resulting in model drift and unstable convergence.
- For example, although the FedDAT method introduces the Dual - Adapter Teacher (DAT) module, it still has deficiencies in dealing with data and task heterogeneity and increases the training complexity.
To solve the above problems, the authors propose a new framework - **FedPIA (Federated Learning via Permuting and Integrating Adapters)**. By introducing the Wasserstein Barycenters theory, this framework performs adapter permutation and integration on the server side and the client side respectively, in order to better integrate client - specific and globally - shared knowledge, thereby improving the stability and performance of the model.
### Main contributions
1. **Introduction of Wasserstein Barycenters**: It is used to synchronize and combine multiple client - side adapters, making them closer in the parameter space, thereby improving the effect of information fusion.
2. **Two - stage permutation and integration**: After initializing the global adapter on the server side, calculate the permutation matrix to align each client - side adapter with the global adapter; on the client side, align and combine the global adapter with the local adapter to achieve better knowledge integration.
3. **Experimental verification**: Through more than 2,000 client - level experiments, using 48 medical image data sets and five different medical vision - language task settings, it is proved that FedPIA has consistency and robustness under various heterogeneous conditions and is significantly better than the existing baseline methods.
Through these improvements, FedPIA not only solves the data privacy and resource limitation problems in multi - modal federated learning, but also effectively copes with the heterogeneity of data and tasks and improves the performance and stability of the model.