Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners

Ziwei Liu,Jingkang Yang,Jiawei Ren,Bo Li,Yezhen Wang
DOI: https://doi.org/10.48550/arXiv.2206.04046
Abstract:Domain generalization (DG) aims at learning generalizable models under distribution shifts to avoid redundantly overfitting massive training data. Previous works with complex loss design and gradient constraint have not yet led to empirical success on large-scale benchmarks. In this work, we reveal the mixture-of-experts (MoE) model’s generalizability on DG by leveraging to distributively handle multiple aspects of the predictive features across domains. To this end, we propose Sparse Fusion Mixture-of-Experts (SF-MoE) , which incorporates sparsity and fusion mechanisms into the MoE framework to keep the model both sparse and predictive. SF-MoE has two dedicated modules: 1) sparse block and 2) fusion block, which disentangle and aggregate the diverse learned signals of an object, respectively. Extensive experiments demonstrate that SF-MoE is a domain-generalizable learner on large-scale benchmarks. It outperforms state-of-the-art counterparts by more than 2% across 5 large-scale DG datasets ( e.g., DomainNet), with the same or even lower computational costs. We further reveal the internal mechanism of SF-MoE from distributed representation perspective ( e.g., visual attributes). We hope this framework could facilitate future research to push generalizable object recognition to the real world. Our code and models will be released at SF-MoE .
Mathematics,Computer Science
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve the generalization ability for out - of - distribution (OOD) data in machine - learning models. Specifically, the paper focuses on how to improve the generalization ability of models between different domains, namely Domain Generalization (DG), by designing the backbone architecture of neural networks. Traditional DG methods mainly focus on the design of loss functions, while this paper proposes a new perspective, that is, improving the generalization performance of models by improving the network architecture. ### Main contributions of the paper 1. **Proposing a new perspective**: - Different from previous work, this paper theoretically explores the impact of the backbone architecture on DG for the first time. Based on the algorithmic alignment theory, the author proves that if the network architecture is aligned with the invariant correlation, the model is more robust to distribution changes; conversely, if it is aligned with the spurious correlation, the model is more sensitive to distribution changes. 2. **Proposing a new model**: - Based on theoretical analysis, the author proposes a new model - Generalizable Mixture - of - Experts (GMoE). GMoE combines sparse Mixture - of - Experts (sparse MoEs) and vision transformers, and improves the performance in DG tasks through architecture improvements guided by theory. 3. **Excellent performance**: - The author verifies the performance of GMoE on eight large - scale datasets in DomainBed. The results show that GMoE achieves the best performance on seven datasets in the training - validation setting; in the leave - one - domain - out setting, it performs excellently on all eight datasets. Moreover, when combined with existing DG algorithms, the performance of GMoE is further improved. ### Key concepts in the paper - **Attribute Factorization**: It describes the attribute factorization in the data generation process, where each attribute may have different impacts on the label. - **Algorithmic Alignment**: It measures the similarity between the network architecture and the objective function. The lower the alignment value, the more suitable the network architecture is for the task. - **Invariant Correlation**: It is the correlation that exists in both the training and test datasets and is the real relationship that the model should rely on. - **Spurious Correlation**: It is the correlation that only exists in the training dataset but not in the test dataset and is the relationship that the model should avoid relying on. ### Experimental results - **DomainBed benchmark**: GMoE significantly outperforms existing DG methods on multiple datasets. Even without using any DG algorithms, its performance exceeds the state - of - the - art methods based on CNN. - **Combined with DG algorithms**: When GMoE is combined with existing DG algorithms (such as Fish and Swad), the performance is further improved, indicating that GMoE is complementary to existing methods. - **Single - source domain generalization**: In single - source domain generalization tasks, GMoE performs excellently, demonstrating its strong generalization ability. In conclusion, through theoretical analysis and experimental verification, this paper proposes a new perspective and model, which significantly improves the generalization ability of machine - learning models in different domains.