Abstract:The frequent data movement between the processor and the memory has become a severe performance bottleneck for deep neural network (DNN) training workloads in data centers. To solve this off-chip memory access challenge, the 3-D stacking processing-in-memory (3D-PIM) architecture provides a viable solution. However, existing 3D-PIM designs for DNN training suffer from the limited memory bandwidth in the base logic die. To overcome this obstacle, integrating the DNN related logic near each memory bank becomes a promising yet challenging solution, since naively implementing the floating-point (FP) unit and the cache in the memory die incurs a large area overhead. To address these problems, we propose DLUX, a high performance and energy-efficient 3D-PIM accelerator for DNN training using the near-bank architecture. From the hardware perspective, to support the FP multiplier with low area overhead, an in-DRAM lookup table (LUT) mechanism is invented. Then, we propose to use a small scratchpad buffer together with a lightweight transformation engine to exploit the locality and enable flexible data layout without the expensive cache. From the software aspect, we split the mapping/scheduling tasks during DNN training into intralayer and interlayer phases. During the intralayer phase, to maximize data reuse in the LUT buffer and the scratchpad buffer, achieve high concurrency, and reduce data movement among banks, a 3D-PIM customized loop tiling technique is adopted. During the interlayer phase, efficient techniques are invented to ensure the input–output data layout consistency and realize the forward–backward layout transposition. Experiment results show that DLUX can reduce FP32 multiplier area overhead by 60% against the direct implementation. Compared with a Tesla V100 GPU, end-to-end evaluations show that DLUX can provide on average $6.3times $ speedup and $42times $ energy efficiency improvement.

DLBooster

Accelerating End-to-End Deep Learning Workflow With Codesign of Data Preprocessing and Scheduling.

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

BOOST: Block Minifloat-Based On-Device CNN Training Accelerator with Transfer Learning

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Dual-pronged deep learning preprocessing on heterogeneous platforms with CPU, GPU and CSD

Pipeline-based Optimization Method for Large-Scale End-to-End Inference.

BatOpt: Optimizing GPU-Based Deep Learning Inference Using Dynamic Batch Processing

DGNN-Booster: A Generic FPGA Accelerator Framework For Dynamic Graph Neural Network Inference

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

DVFO: Learning-Based DVFS for Energy-Efficient Edge-Cloud Collaborative Inference

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster

FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters

Work-in-Progress: Furion: Alleviating Overheads for Deep Learning Framework on Single Machine

DLAU: A Scalable Deep Learning Accelerator Unit on FPGA.

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

SigDLA: A Deep Learning Accelerator Extension for Signal Processing

FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline