Abstract:AbstractDespite the fact that GPUs and accelerators are more efficient in deep learning (DL), commercial clouds like Facebook and Amazon now heavily use CPUs in DL computation because there are large numbers of CPUs which would otherwise sit idle during off-peak periods. Following the trend, CPU vendors have not only released high-performance many-core CPUs but also developed efficient math kernel libraries. However, current DL platforms cannot scale well to a large number of CPU cores, making many-core CPUs inefficient in DL computation. We analyze the memory access patterns of various layers and identify the root cause of the low scalability, i.e., the per-layer barriers that are implicitly imposed by current platforms which assign one single instance (i.e., one batch of input data) to a CPU. The barriers cause severe memory bandwidth contention and CPU starvation in the access-intensive layers (like activation and BN).This paper presents a novel approach called ParaX, which boosts the performance of DL on many-core CPUs by effectively alleviating bandwidth contention and CPU starvation. Our key idea is to assign one instance to each CPU core instead of to the entire CPU, so as to remove the per-layer barriers on the executions of the many cores. ParaX designs an ultralight scheduling policy which sufficiently overlaps the access-intensive layers with the compute-intensive ones to avoid contention, and proposes a NUMA-aware gradient server mechanism for training which leverages shared memory to substantially reduce the overhead of per-iteration parameter synchronization. We have implemented ParaX on MXNet. Extensive evaluation on a two-NUMA Intel 8280 CPU shows that ParaX significantly improves the training/inference throughput for all tested models (for image recognition and natural language processing) by 1.73X ~ 2.93X.

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs

A social network-aware top-N recommender system using GPU.

Distributed Collaborative Hashing and Its Applications in Ant Financial

ParaX: boosting deep learning for big data analytics on many-core CPUs

Automatic Data Reuse for Accelerating Data Intensive Applications in the Cloud

SimpleX: A Simple and Strong Baseline for Collaborative Filtering

Heterogeneity Involved Network-based Algorithm Leads to Accurate and Personalized Recommendations

WooKong: A Ubiquitous Accelerator for Recommendation Algorithms with Custom Instruction Sets on FPGA

ParaX : Bandwidth-Efficient Instance Assignment for DL on Multi-NUMA Many-Core CPUs

Website-oriented recommendation based on heat spreading and tag-aware collaborative filtering

AtRec: Accelerating Recommendation Model Training on CPUs

High Performance Coordinate Descent Matrix Factorization for Recommender Systems.

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

An FPGA-Based Accelerator for Neighborhood-Based Collaborative Filtering Recommendation Algorithms

Alleviating Bias Leads to Accurate and Personalized Recommendation

DCF: A Dataflow-Based Collaborative Filtering Training Algorithm

Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing

A Novel Multi-CPU/GPU Collaborative Computing Framework for SGD-based Matrix Factorization

Cross-Stack Workload Characterization of Deep Recommendation Systems

Semi-sparse Algorithm Based on Multi-Layer Optimization for Recommender System

Efficient Heterogeneous Collaborative Filtering Without Negative Sampling for Recommendation.