Abstract:The pre-trained language models are continually fine-tuned to better support downstream applications. However, this operation may result in significant performance degeneration on general tasks beyond the targeted domain. To overcome this problem, we propose LM-Cocktail which enables the fine-tuned model to stay resilient in general perspectives. Our method is conducted in the form of model merging, where the fine-tuned language model is merged with the pre-trained base model or the peer models from other domains through weighted average. Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain. We conduct comprehensive experiments with LLama and BGE model on popular benchmarks, including FLAN, MMLU, MTEB, whose results validate the efficacy of our proposed method. The code and checkpoints are available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the significant decline in the generalization ability of the model in non - target domains after fine - tuning the pre - trained language model for specific tasks, that is, catastrophic forgetting. Specifically, when a general - purpose language model is fine - tuned to adapt to a specific task, although its performance on this task is improved, its performance on other unseen tasks will drop significantly. This phenomenon is very undesirable in practical applications because, ideally, a language model needs to have both expertise and extensive knowledge. To overcome this problem, the paper proposes the **LM - Cocktail** method. Through model merging, the fine - tuned model can not only maintain high performance on specific tasks but also maintain good generalization ability on general tasks. Specifically, LM - Cocktail merges the fine - tuned language model with the pre - trained base model or similar models from other domains by weighted averaging to achieve this goal. ### Main contributions 1. **Proposed a simple and effective model merging method**: LM - Cocktail merges the fine - tuned model with the base model or fine - tuned models in other domains by weighted averaging, thereby improving the model's performance on general tasks without sacrificing the performance on specific tasks. 2. **Compatible with existing fine - tuning processes**: LM - Cocktail can be carried out as a post - processing step after the fine - tuning process without the need for major modifications to the existing fine - tuning process. 3. **Experimental verification of the effectiveness of the method**: The paper conducted experiments on multiple benchmark datasets, including FLAN, MMLU, and MTEB, and the results show that LM - Cocktail has a significant effect in improving the generalization ability of the model. ### Method details - **Model merging strategy**: The core of LM - Cocktail lies in how to select the models to be merged and determine the merging weights. The paper proposes two main merging strategies: - **Single - expert model merging**: When there is no fine - tuned model in other domains, directly merge the fine - tuned model with the base model. - **Multi - expert model merging**: When there are multiple fine - tuned models in other domains, these models can be merged together with the fine - tuned model and the base model, and the merging weights can be estimated with a small number of samples. - **Weight calculation**: The calculation of the merging weights is based on the performance of the candidate models on the target task. Specifically, the weight calculation formula is: \[ w_i \leftarrow \text{softmax}\left(-\frac{L(M_i, E_t)}{\tau}\right) \] where \( L(M_i, E_t) \) represents the prediction loss of the candidate model \( M_i \) on the small sample \( E_t \) of the target task, and \( \tau \) is the temperature parameter that controls the smoothness. ### Experimental results The paper conducted experiments on two types of models, decoder and encoder, and the results show that: - **Decoder model**: LM - Cocktail maintains the high performance of the fine - tuned model on the target task and also significantly improves its performance on other tasks. - **Encoder model**: Similarly, LM - Cocktail performs well on the target task and also has a significant performance improvement on other tasks. ### Conclusion LM - Cocktail provides a simple and effective method to solve the problem of the decline in the generalization ability of language models after fine - tuning for specific tasks through model merging. This method is applicable not only to decoder models but also to encoder models and has wide applicability.

LM-Cocktail: Resilient Tuning of Language Models via Model Merging

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance

Unlocking the Potential of Model Merging for Low-Resource Languages

Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement

Cross-model Control: Improving Multiple Large Language Models in One-time Training

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models

It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization

Unlocking Continual Learning Abilities in Language Models

Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning

Knowledge Fusion By Evolving Weights of Language Models

Mixture-of-Skills: Learning to Optimize Data Usage for Fine-Tuning Large Language Models

Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance

An Emulator for Fine-Tuning Large Language Models using Small Language Models

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts

CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models

Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming

Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models