Abstract:Modern Machine Learning (ML) training on large-scale datasets is a very time-consuming workload. It relies on the optimization algorithm Stochastic Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance. Processor-centric architectures (e.g., CPUs, GPUs) commonly used for modern ML training workloads based on SGD are bottlenecked by data movement between the processor and memory units due to the poor data locality in accessing large datasets. As a result, processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities of popular distributed SGD algorithms on real-world PIM systems to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized parallel SGD algorithms on the real-world UPMEM PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and highlight the need for a shift to an algorithm-hardware codesign. Our results demonstrate three major findings: 1) The UPMEM PIM system can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, especially when operations and datatypes are natively supported by PIM hardware, 2) it is important to carefully choose the optimization algorithms that best fit PIM, and 3) the UPMEM PIM system does not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. We open source all our code to facilitate future research.

Epuma: A Novel Embedded Parallel DSP Platform for Predictable Computing

Scalable Parallel Computers for Real-Time Signal Processing

A High Performance Implementation of Non-Power-of-Two FFT with EPUMA Platform

Energy-Aware Loop Parallelism Maximization for Multi-core DSP Architectures

MeMPA: A Memory Mapped M-SIMD Co-Processor to Cope with the Memory Wall Issue

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

A High Definition Motion JPEG Encoder Based on Epuma Platform

AsAP: A Fine-Grained Many-Core Platform for DSP Applications

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures

A 167-Processor 65 Nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling

High Performance and Energy Efficient Many-core DSP Systems: An Asynchronous Array of Simple Processors

PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System

YHFT-QDSP: High-Performance Heterogeneous Multi-Core DSP

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

A Scalable and Reconfigurable Bit-Serial Compute-Near-Memory Hardware Accelerator for Solving 2-D/3-D Partial Differential Equations

High-Performance Simultaneous Multiprocessing for Heterogeneous System-on-Chip

Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous Systems

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory