Abstract:Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost of training and evaluating models. Efficient Pareto front approximation of large models enables multi-objective optimization for various tasks such as multi-task learning and trade-off analysis. Existing algorithms for learning Pareto set, including (1) evolutionary, hypernetworks, and hypervolume-maximization methods, are computationally expensive and have restricted scalability to large models; (2) Scalarization algorithms, where a separate model is trained for each objective ray, which is inefficient for learning the entire Pareto set and fails to capture the objective trade-offs effectively. Inspired by the recent success of model merging, we propose a practical and scalable approach to Pareto set learning problem via mixture of experts (MoE) based model fusion. By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives and closely approximate the entire Pareto set of large neural networks. Once the routers are learned and a preference vector is set, the MoE module can be unloaded, thus no additional computational cost is introduced during inference. We conduct extensive experiments on vision and language tasks using large-scale models such as CLIP-ViT and GPT-2. The experimental results demonstrate that our method efficiently approximates the entire Pareto front of large models. Using only hundreds of trainable parameters of the MoE routers, our method even has lower memory usage compared to linear scalarization and algorithms that learn a single Pareto optimal solution, and are scalable to both the number of objectives and the size of the model.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the multi-objective optimization problem in large-scale deep neural networks. Specifically, it focuses on how to efficiently approximate the Pareto front of large-scale models to achieve multi-task learning and trade-off analysis. Existing Pareto set learning algorithms face the following challenges when dealing with large-scale models: 1. **High computational cost**: Traditional evolutionary algorithms, hypernetwork methods, and hypervolume maximization methods can directly handle multi-objective optimization problems, but their computational complexity is high, making it difficult to scale to large-scale models. 2. **Inefficiency**: Scalarization algorithms (such as linear scalarization) require training a separate model for each combination of objectives, which is not only inefficient but also fails to effectively capture the trade-offs between objectives. To overcome these challenges, the paper proposes a model fusion method based on Mixture of Experts (MoE), which approximates the entire Pareto set by integrating the weights of multiple task-specific models. This method is not only computationally efficient but also uses less memory and can scale to multiple objectives and large-scale models. ### Main Contributions 1. **Proposed a new method**: Utilized a Pareto Weight Ensemble Mixture of Experts (PWE MoE) structure to efficiently approximate the entire Pareto set of large-scale neural networks. 2. **Introduced two training strategies**: One based on linear scalarization and the other based on exact Pareto optimal search, which can effectively fine-tune the router of the MoE module. 3. **Explored different model expansion strategies**: Including using only MLP modules or incorporating attention blocks. 4. **Conducted extensive experiments**: Validated the effectiveness of the method using large-scale models (such as CLIP-ViT and GPT-2) in vision and language tasks. ### Experimental Results 1. **Image classification tasks**: Experiments on multiple datasets (such as SUN397, Stanford Cars, DTD, etc.) showed that the method could successfully approximate the Pareto set and provide a good trade-off between two tasks. 2. **Text generation tasks**: Experiments using the GPT-2 model also demonstrated the effectiveness and scalability of the method. 3. **Computational efficiency**: Compared to existing methods, the computational time and memory usage of this method are significantly reduced. For example, in the case of two tasks, the entire Pareto set learning can be completed within a few minutes. In summary, the method proposed in the paper has high computational and memory efficiency in handling the multi-objective optimization problem of large-scale deep neural networks and can effectively approximate the entire Pareto set.

Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging

Mixture of Experts for Intelligent Networks: A Large Language Model-enabled Approach

Mixture of Experts for Network Optimization: A Large Language Model-enabled Approach

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy

Merging Multi-Task Models via Weight-Ensembling Mixture of Experts

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models.

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

Exploring Sparse Expert Models and Beyond

It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

FasterMoE

A Closer Look into Mixture-of-Experts in Large Language Models

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization