Abstract:Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.

MAP: Model Merging with Amortized Pareto Front Using Limited Computation

MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent

What Matters for Model Merging at Scale?

Non-Uniform Parameter-Wise Model Merging

If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

Merging by Matching Models in Task Parameter Subspaces

FREE-Merging: Fourier Transform for Model Merging with Lightweight Experts

HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models

Heterogeneous Model Fusion Federated Learning Mechanism Based on Model Mapping

SUPERMERGE: An Approach For Gradient-Based Model Merging

Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion

AdaMerging: Adaptive Model Merging for Multi-Task Learning

Training-free Heterogeneous Model Merging

You Only Merge Once: Learning the Pareto Set of Preference-Aware Model Merging

Training-Free Pretrained Model Merging

EMR-Merging: Tuning-Free High-Performance Model Merging

It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization

Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion

Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

Parameter Competition Balancing for Model Merging