On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

Dongyang Fan,Bettina Messmer,Martin Jaggi
2024-10-02
Abstract:On-device LLMs have gained increasing attention for their ability to enhance privacy and provide a personalized user experience. To facilitate learning with private and scarce local data, federated learning has become a standard approach, though it introduces challenges related to system and data heterogeneity among end users. As a solution, we propose a novel $\textbf{Co}$llaborative learning approach with a $\textbf{Mi}$xture of $\textbf{G}$eneralists and $\textbf{S}$pecialists (CoMiGS), being the first to effectively address both. Our approach distinguishes generalists and specialists by aggregating certain experts across end users while keeping others localized to specialize in user-specific datasets. A key innovation of our method is the bi-level optimization formulation of the Mixture-of-Experts learning objective, where the router is updated using a separate validation set that represents the target distribution. CoMiGS effectively balances collaboration and personalization, as demonstrated by its superior performance in scenarios with high data heterogeneity across multiple datasets. By design, our approach accommodates users' varying computational resources through different numbers of specialists. By decoupling resource abundance from data quantity, CoMiGS remains robust against overfitting-due to the generalists' regularizing effect-while adapting to local data through specialist expertise.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two major challenges faced when performing personalized collaborative fine - tuning of large - language models (LLMs) on - device: **system heterogeneity** and **data heterogeneity**. Specifically: 1. **System heterogeneity**: - Different users' devices have different computing resources, resulting in differences in model architectures and the number of parameters. - A method is required to be able to run effectively on devices with different computing resources while maintaining model performance. 2. **Data heterogeneity**: - There are large differences in the distribution of users' local data, resulting in significant differences in datasets of different users in terms of topics, language habits, etc. - It is necessary to perform effective model fine - tuning using limited local data while protecting users' privacy. To solve these problems, the paper proposes a new collaborative learning method - **CoMiGS (Collaborative learning approach with a Mixture of Generalists and Specialists)**. This method is achieved in the following ways: - **Generalists**: Share some model parameters so that different users can collaborate in learning, thereby improving the generalization ability of the model. - **Specialists**: Retain some model parameters as user - specific to adapt to the unique characteristics of local data and provide personalized solutions. - **Two - level optimization framework**: Introduce a two - level optimization formula, in which the router parameters are updated using an independent validation set, while the expert parameters are updated based on the training set. This design enables the model to better adapt to the target distribution, especially in cases where the data distribution is inconsistent. In this way, CoMiGS can effectively balance the needs for collaboration and personalization while dealing with the problems of system and data heterogeneity. In addition, this method also allows for the dynamic adjustment of the number of expert modules according to users' computing resources, thereby further improving flexibility and robustness. ### Summary The main contributions of the paper are: - Propose a new CoMiGS method, which effectively solves the problems of system heterogeneity and data heterogeneity for the first time. - Introduce an innovative two - level optimization formula, which improves the performance of the model in cases of distribution shift. - Separate the relationship between resource heterogeneity and data volume, so that users with more local data can benefit from larger models, while users with stronger computing resources but less data are less likely to over - fit.