LM-Cocktail: Resilient Tuning of Language Models via Model Merging

Shitao Xiao,Zheng Liu,Peitian Zhang,Xingrun Xing
2023-12-09
Abstract:The pre-trained language models are continually fine-tuned to better support downstream applications. However, this operation may result in significant performance degeneration on general tasks beyond the targeted domain. To overcome this problem, we propose LM-Cocktail which enables the fine-tuned model to stay resilient in general perspectives. Our method is conducted in the form of model merging, where the fine-tuned language model is merged with the pre-trained base model or the peer models from other domains through weighted average. Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain. We conduct comprehensive experiments with LLama and BGE model on popular benchmarks, including FLAN, MMLU, MTEB, whose results validate the efficacy of our proposed method. The code and checkpoints are available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the significant decline in the generalization ability of the model in non - target domains after fine - tuning the pre - trained language model for specific tasks, that is, catastrophic forgetting. Specifically, when a general - purpose language model is fine - tuned to adapt to a specific task, although its performance on this task is improved, its performance on other unseen tasks will drop significantly. This phenomenon is very undesirable in practical applications because, ideally, a language model needs to have both expertise and extensive knowledge. To overcome this problem, the paper proposes the **LM - Cocktail** method. Through model merging, the fine - tuned model can not only maintain high performance on specific tasks but also maintain good generalization ability on general tasks. Specifically, LM - Cocktail merges the fine - tuned language model with the pre - trained base model or similar models from other domains by weighted averaging to achieve this goal. ### Main contributions 1. **Proposed a simple and effective model merging method**: LM - Cocktail merges the fine - tuned model with the base model or fine - tuned models in other domains by weighted averaging, thereby improving the model's performance on general tasks without sacrificing the performance on specific tasks. 2. **Compatible with existing fine - tuning processes**: LM - Cocktail can be carried out as a post - processing step after the fine - tuning process without the need for major modifications to the existing fine - tuning process. 3. **Experimental verification of the effectiveness of the method**: The paper conducted experiments on multiple benchmark datasets, including FLAN, MMLU, and MTEB, and the results show that LM - Cocktail has a significant effect in improving the generalization ability of the model. ### Method details - **Model merging strategy**: The core of LM - Cocktail lies in how to select the models to be merged and determine the merging weights. The paper proposes two main merging strategies: - **Single - expert model merging**: When there is no fine - tuned model in other domains, directly merge the fine - tuned model with the base model. - **Multi - expert model merging**: When there are multiple fine - tuned models in other domains, these models can be merged together with the fine - tuned model and the base model, and the merging weights can be estimated with a small number of samples. - **Weight calculation**: The calculation of the merging weights is based on the performance of the candidate models on the target task. Specifically, the weight calculation formula is: \[ w_i \leftarrow \text{softmax}\left(-\frac{L(M_i, E_t)}{\tau}\right) \] where \( L(M_i, E_t) \) represents the prediction loss of the candidate model \( M_i \) on the small sample \( E_t \) of the target task, and \( \tau \) is the temperature parameter that controls the smoothness. ### Experimental results The paper conducted experiments on two types of models, decoder and encoder, and the results show that: - **Decoder model**: LM - Cocktail maintains the high performance of the fine - tuned model on the target task and also significantly improves its performance on other tasks. - **Encoder model**: Similarly, LM - Cocktail performs well on the target task and also has a significant performance improvement on other tasks. ### Conclusion LM - Cocktail provides a simple and effective method to solve the problem of the decline in the generalization ability of language models after fine - tuning for specific tasks through model merging. This method is applicable not only to decoder models but also to encoder models and has wide applicability.