What Matters for Model Merging at Scale?

Prateek Yadav,Tu Vu,Jonathan Lai,Alexandra Chronopoulou,Manaal Faruqui,Mohit Bansal,Tsendsuren Munkhdalai
2024-10-05
Abstract:Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore the effectiveness of large-scale model merging and its performance under different factors. Specifically, the authors systematically evaluate the impact of factors such as model size, base model quality, merging methods, and the number of expert models involved in the merging process on the performance of the merged model. Although previous research has primarily focused on merging a few small models, these studies have not adequately addressed how these factors interact to affect the performance of the merged model when the model scale is increased. ### Main Research Questions 1. **What are the differences in effectiveness between pre-trained models and instruction-tuned models when creating expert models?** - Investigate the impact of different types of base models (pre-trained models vs. instruction-tuned models) on the performance of the merged model. 2. **Does model merging become easier or more difficult as the model scale increases?** - Explore the impact of model scale on the merging process. 3. **What is the impact of merging on the performance of unseen tasks (zero-shot generalization), and how is this impact affected by model scale?** - Analyze the generalization ability of the merged model on unseen tasks and explore the role of model scale. 4. **How many expert models can be merged without losing performance, and how does this depend on model scale?** - Investigate the number of expert models that can be successfully merged under different model scales. ### Experimental Design To answer the above questions, the authors conducted extensive experiments using the PaLM-2 model and its instruction-tuned version PaLM-2-IT, extending to a model scale of 6.4 billion parameters. They experimented with four popular merging methods: averaging, task arithmetic, Dare-TIES, and TIES merging. The experiments covered the merging of 2 to 8 expert models and were evaluated on two axes: training tasks (held-in) and unseen tasks (held-out). ### Experimental Results 1. **Instruction-tuned models facilitate more effective merging**: - Expert models created using instruction-tuned base models (such as PaLM-2-IT) showed better performance after merging, especially when using various merging methods and different model scales. 2. **Larger models are easier to merge**: - As the model scale increases, the merging process becomes easier, and the performance of the merged model improves. Particularly, the 6.4 billion parameter PaLM-2-IT model is the easiest to merge. 3. **Merging significantly improves zero-shot generalization ability**: - The generalization ability of the merged model on unseen tasks is significantly enhanced, especially when using strong base models and multiple expert models. The performance of the merged model even surpasses the baseline model trained on multiple tasks. 4. **Larger models can effectively merge more expert models**: - Large-scale models can merge more expert models without losing performance. 5. **Different merging methods perform similarly on large-scale instruction-tuned models**: - On large-scale models, different merging methods perform similarly, indicating that these methods are relatively consistent in their effectiveness on large-scale models. ### Conclusion This study, through systematic experiments, reveals the interactions of various factors in the model merging process and their impact on the performance of the merged model. The results indicate that using instruction-tuned base models, larger model scales, and appropriate merging methods can significantly improve the performance of the merged model, particularly in terms of zero-shot generalization. These findings provide important references for future research on large-scale model merging.