MultiLoRA: Democratizing LoRA for Better Multi-Task Learning

Yiming Wang,Yu Lin,Xiaodong Zeng,Guannan Zhang
2023-11-20
Abstract:LoRA achieves remarkable resource efficiency and comparable performance when adapting LLMs for specific tasks. Since ChatGPT demonstrated superior performance on various tasks, there has been a growing desire to adapt one model for all tasks. However, the explicit low-rank of LoRA limits the adaptation performance in complex multi-task scenarios. LoRA is dominated by a small number of top singular vectors while fine-tuning decomposes into a set of less important unitary transforms. In this paper, we propose MultiLoRA for better multi-task adaptation by reducing the dominance of top singular vectors observed in LoRA. MultiLoRA scales LoRA modules horizontally and change parameter initialization of adaptation matrices to reduce parameter dependency, thus yields more balanced unitary subspaces. We unprecedentedly construct specialized training data by mixing datasets of instruction follow, natural language understanding, world knowledge, to cover semantically and syntactically different samples. With only 2.5% of additional parameters, MultiLoRA outperforms single LoRA counterparts and fine-tuning on multiple benchmarks and model scales. Further investigation into weight update matrices of MultiLoRA exhibits reduced dependency on top singular vectors and more democratic unitary transform contributions.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to improve the adaptation performance of LoRA (Low - Rank Adaptation) in complex multi - task scenarios while maintaining parameter efficiency. Specifically, the paper points out that although LoRA performs well in single - task adaptation, in complex multi - task scenarios, due to its explicit low - rank limitation, the adaptation performance is limited. This is mainly manifested in the weight update matrix of LoRA, where the top singular vectors dominate, while full - parameter fine - tuning is more "democratic", that is, the importance distribution of each singular vector is more uniform. To overcome this problem, the paper proposes the MultiLoRA method. By horizontally expanding the LoRA module, changing the parameter initialization method, and introducing a learnable scaling factor, it reduces parameter dependence, thereby achieving a more balanced unit transformation subspace. In addition, the paper also constructs a data set containing multiple tasks to cover semantically and syntactically different samples, further verifying the effectiveness of MultiLoRA. ### Main contributions: 1. **Discover the unit transformation dominance phenomenon in the LoRA weight update matrix**: Compared with full - parameter fine - tuning, in the weight update matrix of LoRA, the top singular vectors dominate, while full - parameter fine - tuning is more "democratic". 2. **Propose the MultiLoRA method**: By horizontally expanding the LoRA module, changing the parameter initialization method, and introducing a learnable scaling factor, it reduces parameter dependence and makes the weight update matrix more "democratic". 3. **Construct a multi - task learning scheme**: Based on multiple tasks of interest to generative LLM, a multi - task data set covering semantically and syntactically different samples is constructed. The experimental results show that MultiLoRA outperforms LoRA on multiple benchmarks and model scales, and even surpasses full - parameter fine - tuning on small models. ### Experimental results: - **Multi - task adaptation ability**: MultiLoRA shows stronger data - fitting ability and more consistent performance in multiple benchmark tests. Especially on small models, the average score of MultiLoRA is 2.8% higher than that of LoRA, and it even surpasses full - parameter fine - tuning on some tasks. - **Resource use and throughput**: MultiLoRA is close to LoRA in terms of training throughput, approximately 400 tokens/GPU/second, which is much higher than 208 tokens/GPU/second of full - parameter fine - tuning. However, the VRAM usage of MultiLoRA increases linearly with the increase in the number of parallel LoRA modules, which may be a problem in long - sequence training. ### Conclusion: MultiLoRA achieves a more balanced unit transformation subspace by reducing the dominance of the top singular vectors, thus showing better adaptation performance in complex multi - task scenarios. The experimental results show that MultiLoRA not only outperforms LoRA in performance, but also has advantages in resource use and throughput, especially on small models, and even surpasses full - parameter fine - tuning.