Abstract:The frequent data movement between the processor and the memory has become a severe performance bottleneck for deep neural network (DNN) training workloads in data centers. To solve this off-chip memory access challenge, the 3-D stacking processing-in-memory (3D-PIM) architecture provides a viable solution. However, existing 3D-PIM designs for DNN training suffer from the limited memory bandwidth in the base logic die. To overcome this obstacle, integrating the DNN related logic near each memory bank becomes a promising yet challenging solution, since naively implementing the floating-point (FP) unit and the cache in the memory die incurs a large area overhead. To address these problems, we propose DLUX, a high performance and energy-efficient 3D-PIM accelerator for DNN training using the near-bank architecture. From the hardware perspective, to support the FP multiplier with low area overhead, an in-DRAM lookup table (LUT) mechanism is invented. Then, we propose to use a small scratchpad buffer together with a lightweight transformation engine to exploit the locality and enable flexible data layout without the expensive cache. From the software aspect, we split the mapping/scheduling tasks during DNN training into intralayer and interlayer phases. During the intralayer phase, to maximize data reuse in the LUT buffer and the scratchpad buffer, achieve high concurrency, and reduce data movement among banks, a 3D-PIM customized loop tiling technique is adopted. During the interlayer phase, efficient techniques are invented to ensure the input–output data layout consistency and realize the forward–backward layout transposition. Experiment results show that DLUX can reduce FP32 multiplier area overhead by 60% against the direct implementation. Compared with a Tesla V100 GPU, end-to-end evaluations show that DLUX can provide on average $6.3times $ speedup and $42times $ energy efficiency improvement.

Violet: Architecturally Exposed Orchestration, Movement, and Placement for Generalized Deep Learning

DaDianNao: A Machine-Learning Supercomputer

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

A Hardware-Software Blueprint for Flexible Deep Learning Specialization

A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration

Collage: Seamless Integration of Deep Learning Backends with Automatic Placement

CATERPILLAR: Coarse Grain Reconfigurable Architecture for Accelerating the Training of Deep Neural Networks

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Invited: Algorithm-Software-Hardware Co-Design for Deep Learning Acceleration

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

Enabling High Performance Deep Learning Networks on Embedded Systems

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

fVDB : A Deep-Learning Framework for Sparse, Large Scale, and High Performance Spatial Intelligence

An architecture-level analysis on deep learning models for low-impact computations

Does Form Follow Function? An Empirical Exploration of the Impact of Deep Neural Network Architecture Design on Hardware-Specific Acceleration

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

Toward Efficient Execution of Mainstream Deep Learning Frameworks on Mobile Devices: Architectural Implications

A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices

Toward matrix multiplication for deep learning inference on the Xilinx Versal

Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision

Efficient Architecture Paradigm for Deep Learning Inference As a Service.