Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

Tao Li,Zhengbao He,Yujun Li,Yasheng Wang,Lifeng Shang,Xiaolin Huang
2024-09-22
Abstract:Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computational and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, provides an efficient way to fine-tune models by optimizing only a low-rank matrix. Despite recent progress made in improving LoRA's performance, the connection between the LoRA optimization space and the original full parameter space is often overlooked. A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance. In this paper, we propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space.Instead of relying on the well-established sharpness-aware minimization approach, which can incur significant computational and memory burdens, we utilize random weight perturbation with a Bayesian expectation loss objective to maintain training efficiency and design a refined perturbation generation strategy for improved performance. Experiments on natural language processing and image classification tasks with various architectures demonstrate the effectiveness of our approach.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive computational and memory costs when fine - tuning large - scale pre - trained models. Specifically, the paper focuses on the Low - Rank Adaptation (LoRA) method. Although LoRA significantly reduces the number of parameters and training costs by only optimizing low - rank matrices, its performance optimization mainly focuses on the LoRA subspace and ignores the relationship with the original full - parameter space. **Key issues**: 1. **Disconnection between LoRA subspace and full - parameter space**: - Solutions that perform well in the LoRA subspace may be in sharp regions in the full - parameter space, which may damage the generalization performance. - For example, as shown in Figure 1, a flat minimum in the LoRA subspace (blue curve) may show a sharp direction in the full - parameter space (red curve), thus affecting the generation performance. 2. **Limitations of existing methods**: - Although previous works have attempted to improve LoRA performance by introducing more dedicated budgets, decomposing optimization directions and magnitude updates, and designing better initialization strategies, most of these methods only focus on optimization within the LoRA subspace. - The traditional Sharpness - Aware Minimization (SAM) method can effectively find flat minima, but it will significantly increase training time and memory overhead and is not suitable for fine - tuning large - scale models. ### Solutions: To solve the above problems, the paper proposes the **Flat - LoRA** method, and its main contributions include: 1. **Optimizing flatness in the full - parameter space**: - Flat - LoRA aims to optimize the loss landscape flatness in the full - parameter space where low - rank adaptation is located, to ensure that the combined weights are in flat regions, thereby improving generalization performance. 2. **Using Bayesian expected loss optimization**: - In order to maintain training efficiency and design an effective random weight perturbation generation strategy, Flat - LoRA uses the Bayesian expected loss objective function: \[ \min_{A,B} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \sigma^2 I)} L(W + s\cdot BA+\epsilon) \] - This method restores a flatter minimum by applying a smoothing filter in the full - parameter space and does not require additional gradient steps. 3. **Efficient random perturbation generation strategy**: - A new weight noise generation scheme is proposed, which takes into account the influence of the filtering structure and input dimensions, ensuring that the variance introduced during the forward propagation process is independent of the input dimension: \[ \epsilon \sim \mathcal{N}\left(0, \frac{\sigma^2}{n} \text{diag}(\|W'_1,\|\_2^2, \|W'_2,\|\_2^2, \cdots, \|W'_m,\|\_2^2)I_{m\times n}\right) \] 4. **Experimental verification**: - Extensive experiments have been carried out on natural language processing and computer vision tasks, and the results show that Flat - LoRA can achieve state - of - the - art performance under different architectures and can be easily integrated into existing methods to obtain consistent improvements. Through these innovations, Flat - LoRA not only improves the performance of LoRA but also solves the problem of poor generalization performance of existing methods in the full - parameter space.