Abstract:Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- and time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion. In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into DLRM training pipeline bottlenecks and challenges. We study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to observe the performance impacts of online ingestion and to identify shortfalls in existing pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute the CPU resources of a trainer machine across a DLRM data pipeline to more effectively parallelize data loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.

PrecisionProbe: Non-intrusive Performance Analysis Tool for Deep Learning Recommendation Models

A social network-aware top-N recommender system using GPU.

DProbe: Profiling and Predicting Multi-tenant Deep Learning Workloads for GPU Resource Scaling.

Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

DeepProf: Performance Analysis for Deep Learning Applications via Mining GPU Execution Patterns

AtRec: Accelerating Recommendation Model Training on CPUs

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability.

Enhancing Performance and Scalability of Large-Scale Recommendation Systems with Jagged Flash Attention

NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation Serving