On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Huy Nguyen,Xing Han,Carl William Harris,Suchi Saria,Nhat Ho
2024-10-04
Abstract:With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our investigation highlights the advantages of using varied gating functions, moving beyond softmax gating within HMoE frameworks. We theoretically demonstrate that applying tailored gating functions to each expert group allows HMoE to achieve robust results, even when optimal gating functions are applied only at select hierarchical levels. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements.
Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore the choice of different gating functions in the Hierarchical Mixture of Experts (HMoE) model and their impact on the overall performance. Specifically, the paper studies the following issues: 1. **Impact of gating function selection**: The traditional HMoE model usually uses the softmax gating function, but is this choice optimal? Through theoretical analysis and experiments, the paper proves that using different gating functions (such as the Laplace gating function) can significantly improve the performance of the model, especially when dealing with complex datasets. 2. **Expert convergence behavior**: The specialization speed of experts in the network is a key issue. By analyzing the convergence behavior of experts in a two - layer HMoE model, the paper explores the impact of different gating function combinations on the specialization speed of experts. 3. **Model performance improvement**: Through experiments on multiple tasks (such as large - scale multi - modal tasks, image classification, latent domain discovery, and prediction tasks), the paper proves that the modified HMoE model shows significant performance improvement on these tasks. ### Specific research content 1. **Theoretical contributions**: - **Convergence of density estimation**: Using the maximum likelihood estimation (MLE) method, the paper analyzes the density estimation convergence rate under different gating function combinations. Specifically, the paper derives the Hellinger distance convergence rate of density estimation under different gating function combinations. - **Convergence of parameter estimation**: The paper introduces the Voronoi loss function to accurately describe the convergence behavior of parameter estimation and proves the convergence rate of parameter estimation under different gating function combinations. 2. **Experimental verification**: - **Multi - modal fusion**: The paper conducts experiments on the MIMIC - IV dataset to evaluate the performance of the HMoE model in multi - modal fusion tasks. The experimental results show that the HMoE model is significantly superior to the baseline model in tasks such as 48 - hour in - hospital mortality prediction, 25 - phenotype classification, and in - hospital stay prediction. - **Comparison of different gating function combinations**: By comparing the performance of different gating function combinations (such as Softmax - Softmax, Softmax - Laplace, Laplace - Laplace), the paper verifies the superiority of the Laplace gating function in a two - layer HMoE model. ### Main conclusions 1. **Advantages of the Laplace gating function**: Through theoretical analysis and experiments, the paper proves that using the Laplace gating function can accelerate the convergence speed of experts and significantly improve the overall performance of the model. 2. **Applicability of the HMoE model**: The HMoE model performs well in dealing with complex datasets (such as multi - modal data, data with an inherent hierarchical structure) and can effectively meet the specialization needs of different subgroups. 3. **Future research directions**: The paper lays the foundation for future research, especially providing new ideas in exploring the impact of different gating function combinations on the performance of the HMoE model. In conclusion, through theoretical analysis and experiments, this paper proves the importance of using different gating functions in the HMoE model and shows the significant advantages of the Laplace gating function in improving the performance of the model.