Abstract:Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. Recent research on task arithmetic-based MTL demonstrates that merging the parameters of independently fine-tuned models can effectively achieve MTL. However, existing merging methods primarily seek a static optimal solution within the original model parameter space, which often results in performance degradation due to the inherent diversity among tasks and potential interferences. To address this challenge, in this paper, we propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. Specifically, we first identify critical (or sensitive) modules by analyzing parameter variations in core modules of Transformer-based models before and after finetuning. Then, our WEMoE statically merges non-critical modules while transforming critical modules into a mixture-of-experts (MoE) structure. During inference, expert modules in the MoE are dynamically merged based on input samples, enabling a more flexible and adaptive merging approach. Building on WEMoE, we further introduce an efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE. Experimental results across various architectures and tasks demonstrate that both WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.

BERTogether: Multitask Ensembling with Hyperparameter Optimization

Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models

BERTer: The Efficient One

Multitask Fine-Tuning and Generative Adversarial Learning for Improved Auxiliary Classification

Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

Merging Multi-Task Models via Weight-Ensembling Mixture of Experts

Parameter-efficient Weight Ensembling Facilitates Task-level Knowledge Transfer.

How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

A Flexible Multi-Task Model for BERT Serving

BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

BEBERT: Efficient and Robust Binary Ensemble BERT

Optimizing Multi-Task Learning for Enhanced Performance in Large Language Models

Multi-Layer Ensembling Techniques for Multilingual Intent Classification

MF-BERT: Multimodal Fusion in Pre-Trained BERT for Sentiment Analysis

Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging

Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data

All-in-One: Emotion, Sentiment and Intensity Prediction Using a Multi-Task Ensemble Framework

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

CM-BERT