PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Shiva Kumar Pentyala,Zhichao Wang,Bin Bi,Kiran Ramnath,Xiang-Bo Mao,Regunathan Radhakrishnan,Sitaram Asur,Cheng
2024-06-26
Abstract:Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "PAFT: A Parallel Training Paradigm for Effective LLM Fine - Tuning" aims to solve the "alignment tax" problem faced by large language models (LLMs) during supervised fine - tuning (SFT) and preference alignment (such as DPO and ORPO, etc.). Specifically, the traditional sequential training process (first performing SFT, and then preference alignment) will lead to a decline in the model's performance on specific tasks because the diverse capabilities obtained through SFT may be forgotten during the preference alignment process. ### Solutions To reduce the "alignment tax", the paper proposes a new parallel training paradigm (PAFT), with the following main features: 1. **Parallel training**: - **SFT** and **preference alignment** are carried out independently, using the same pre - trained model to be trained on different datasets respectively. - The models after SFT and preference alignment are combined into a final model through parameter fusion for downstream applications. 2. **Sparsity handling**: - It is found that SFT naturally leads to a dense model, while preference alignment (such as DPO) naturally leads to a sparse model. - An effective interference resolution method is proposed. By adding an L1 - norm penalty to the SFT loss function to reduce redundancy, the incremental parameters (delta parameters) are sparsified. - This method can significantly improve the performance of the final model, especially outstanding in different model merging methods. ### Experimental results - **Benchmark tests**: A comprehensive evaluation was carried out in the HuggingFace Open LLM Leaderboard and AlpacaEval benchmark tests. - **Performance improvement**: PAFT significantly outperforms the traditional sequential training method and the individual training method on multiple benchmark tasks. In particular, PAFT ranks first in the 7B/8B model category and also ranks high on the global leaderboard. - **Importance of sparsity**: The experimental results show that PAFT with sparsity introduced performs better when merging models, especially in merging methods such as TIES and DARE TIES. ### Main contributions 1. **Advantages of parallel training**: It is proved that parallel training of SFT and preference alignment is superior to sequential training, effectively reducing the "alignment tax". 2. **Importance of sparse model integration**: The importance of sparse model integration in preventing model conflicts while retaining the complete capabilities of each model is emphasized. The superiority of the L1 - norm in promoting model training sparsity is demonstrated. 3. **Comprehensive evaluation**: PAFT has been comprehensively evaluated in multiple well - known public benchmark tests, verifying its effectiveness and robustness on different models and tasks. ### Conclusion Through parallel training and sparsity handling, PAFT effectively solves the "alignment tax" problem in traditional sequential training and improves the performance of large language models in downstream tasks. This method provides new ideas and directions for future LLM optimization.