Abstract:Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging because they contain billions of embedding-based parameters, leading to significant overheads from embedding access. By profiling existing systems for DLRM training, we observe that around 75\% of the iteration time is spent on embedding access and model synchronization. Our key insight in this paper is that embedding access has a specific structure which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with around 1\% of embeddings representing more than 92\% of total accesses. Further, we observe that during offline training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insights, we develop Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We design an Oracle Cacher, a new component that uses a lookahead algorithm to generate optimal cache update decisions while providing strong consistency guarantees against staleness. We also design a logically replicated, physically partitioned cache and show that our design can reduce synchronization overheads in a distributed setting. Finally, we propose a disaggregated system architecture and show that our design can enable low-overhead fault tolerance. Our experiments using three datasets and four models show that Bagpipe provides a speed up of up to 5.6x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.

Heterogeneous Acceleration Pipeline for Recommendation System Training

Accelerating Recommendation System Training by Leveraging Popular Choices

A social network-aware top-N recommender system using GPU.

Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

BagPipe: Accelerating Deep Recommendation Model Training

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

HetHub: A Heterogeneous Distributed Hybrid Training System for Large-Scale Models

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

AtRec: Accelerating Recommendation Model Training on CPUs

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Optimizing Inference Quality with SmartNIC for Recommendation System

Fleche: an efficient GPU embedding cache for personalized recommendations

POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model Training.

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models