Abstract:Domain generalization (DG) aims at learning generalizable models under distribution shifts to avoid redundantly overﬁtting massive training data. Previous works with complex loss design and gradient constraint have not yet led to empirical success on large-scale benchmarks. In this work, we reveal the mixture-of-experts (MoE) model’s generalizability on DG by leveraging to distributively handle multiple aspects of the predictive features across domains. To this end, we propose Sparse Fusion Mixture-of-Experts (SF-MoE) , which incorporates sparsity and fusion mechanisms into the MoE framework to keep the model both sparse and predictive. SF-MoE has two dedicated modules: 1) sparse block and 2) fusion block, which disentangle and aggregate the diverse learned signals of an object, respectively. Extensive experiments demonstrate that SF-MoE is a domain-generalizable learner on large-scale benchmarks. It outperforms state-of-the-art counterparts by more than 2% across 5 large-scale DG datasets ( e.g., DomainNet), with the same or even lower computational costs. We further reveal the internal mechanism of SF-MoE from distributed representation perspective ( e.g., visual attributes). We hope this framework could facilitate future research to push generalizable object recognition to the real world. Our code and models will be released at SF-MoE .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve the generalization ability for out - of - distribution (OOD) data in machine - learning models. Specifically, the paper focuses on how to improve the generalization ability of models between different domains, namely Domain Generalization (DG), by designing the backbone architecture of neural networks. Traditional DG methods mainly focus on the design of loss functions, while this paper proposes a new perspective, that is, improving the generalization performance of models by improving the network architecture. ### Main contributions of the paper 1. **Proposing a new perspective**: - Different from previous work, this paper theoretically explores the impact of the backbone architecture on DG for the first time. Based on the algorithmic alignment theory, the author proves that if the network architecture is aligned with the invariant correlation, the model is more robust to distribution changes; conversely, if it is aligned with the spurious correlation, the model is more sensitive to distribution changes. 2. **Proposing a new model**: - Based on theoretical analysis, the author proposes a new model - Generalizable Mixture - of - Experts (GMoE). GMoE combines sparse Mixture - of - Experts (sparse MoEs) and vision transformers, and improves the performance in DG tasks through architecture improvements guided by theory. 3. **Excellent performance**: - The author verifies the performance of GMoE on eight large - scale datasets in DomainBed. The results show that GMoE achieves the best performance on seven datasets in the training - validation setting; in the leave - one - domain - out setting, it performs excellently on all eight datasets. Moreover, when combined with existing DG algorithms, the performance of GMoE is further improved. ### Key concepts in the paper - **Attribute Factorization**: It describes the attribute factorization in the data generation process, where each attribute may have different impacts on the label. - **Algorithmic Alignment**: It measures the similarity between the network architecture and the objective function. The lower the alignment value, the more suitable the network architecture is for the task. - **Invariant Correlation**: It is the correlation that exists in both the training and test datasets and is the real relationship that the model should rely on. - **Spurious Correlation**: It is the correlation that only exists in the training dataset but not in the test dataset and is the relationship that the model should avoid relying on. ### Experimental results - **DomainBed benchmark**: GMoE significantly outperforms existing DG methods on multiple datasets. Even without using any DG algorithms, its performance exceeds the state - of - the - art methods based on CNN. - **Combined with DG algorithms**: When GMoE is combined with existing DG algorithms (such as Fish and Swad), the performance is further improved, indicating that GMoE is complementary to existing methods. - **Single - source domain generalization**: In single - source domain generalization tasks, GMoE performs excellently, demonstrating its strong generalization ability. In conclusion, through theoretical analysis and experimental verification, this paper proposes a new perspective and model, which significantly improves the generalization ability of machine - learning models in different domains.

Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners

PFL-MoE: Personalized Federated Learning Based on Mixture of Experts

FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts

Generalizable Person Re-identification with Relevance-aware Mixture of Experts

LFME: A Simple Framework for Learning from Multiple Experts in Domain Generalization

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

Mixture of Diverse Size Experts

Multi-modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

SciDFM: A Large Language Model with Mixture-of-Experts for Science

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

From Sparse to Soft Mixtures of Experts

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Adaptive Mixture of Experts Learning for Generalizable Face Anti-Spoofing

FasterMoE