Turn Waste into Worth: Rectifying Top-$k$ Router of MoE
Zhiyuan Zeng,Qipeng Guo,Zhaoye Fei,Zhangyue Yin,Yunhua Zhou,Linyang Li,Tianxiang Sun,Hang Yan,Dahua Lin,Xipeng Qiu
2024-02-17
Abstract:Sparse Mixture of Experts (MoE) models are popular for training large
language models due to their computational efficiency. However, the commonly
used top-$k$ routing mechanism suffers from redundancy computation and memory
costs due to the unbalanced routing. Some experts are overflow, where the
exceeding tokens are dropped. While some experts are vacant, which are padded
with zeros, negatively impacting model performance. To address the dropped
tokens and padding, we propose the Rectify-Router, comprising the Intra-GPU
Rectification and the Fill-in Rectification. The Intra-GPU Rectification
handles dropped tokens, efficiently routing them to experts within the GPU
where they are located to avoid inter-GPU communication. The Fill-in
Rectification addresses padding by replacing padding tokens with the tokens
that have high routing scores. Our experimental results demonstrate that the
Intra-GPU Rectification and the Fill-in Rectification effectively handle
dropped tokens and padding, respectively. Furthermore, the combination of them
achieves superior performance, surpassing the accuracy of the vanilla top-1
router by 4.7%.
Machine Learning,Artificial Intelligence,Computation and Language