HET-GMP: A Graph-based System Approach to Scaling Large Embedding Model Training

Xupeng Miao,Yining Shi,Hailin Zhang,Xin Zhang,Xiaonan Nie,Zhi Yang,Bin Cui
DOI: https://doi.org/10.1145/3514221.3517902
2022-01-01
Abstract:Embedding models have been recognized as an effective learning paradigm for high-dimensional data. However, a major embedding model training obstacle is that updating and retrieving the shared large-scale embedding parameters usually dominates the distributed training cycle, leading to significant scalability issues. This paper presents HET-GMP, a distributed system on training embedding models. Uniquely, HET-GMP takes advantage of a graph-based approach to efficiently increase scalability. The key insight guiding our design is the "graph way of thinking". HET-GMP creates a bigraph abstraction to represent the access relationships between data samples and embedding vectors. This enables HET-GMP to embrace graph locality and skewness as new performance opportunities and to exploit graph-based replication/partitioning and bounded-asynchronous synchronization to reduce communication overhead. We evaluate the system on the embedding models for click-through rate (CTR) prediction, which presents the most significant challenge and communication bottleneck due to heavy access concurrency to a huge embedding table. The result shows that HET-GMP supports embedding model training with 1011 parameters, achieving a reduction in communication up to 87.5% and an up-to 27.5x speedup over the state-of-the-art baseline systems.
What problem does this paper attempt to address?