Abstract:Recently, Graph Convolutional Networks (GCNs) have shown powerful learning capabilities in graph processing tasks. Computing GCNs with conventional von Neumann architectures usually suffers from limited memory bandwidth due to the irregular memory access. Recent work has proposed Processing-In-Memory (PIM) architectures to overcome the bandwidth bottleneck in Convolutional Neural Networks (CNNs) by performing in-situ matrix-vector multiplication. However, the performance improvement and computation parallelism of existing CNN-oriented PIM architectures is hindered when performing GCNs because of the large scale and sparsity of graphs. To tackle these problems, this paper presents a parallelism enhancement framework for PIM-based GCN architectures. At the software level, we propose a fixed-point quantization method for GCNs, which reduces the PIM computation overhead with little accuracy loss. We also introduce the vertex clustering algorithm to the graph, minimizing the inter-cluster links and realizing cluster-level parallel computing on multi-core systems. At the hardware level, we design a Resistive Random Access Memory (RRAM) based multi-core PIM architecture for GCN, which supports the cluster-level parallelism. Besides, we propose a coarse-grained pipeline dataflow to cover the RRAM write costs and improve the GCN computation throughput. At the software/hardware interface level, we propose a PIM-aware GCN mapping strategy to achieve the optimal tradeoff between resource utilization and computation performance. We also propose edge dropping methods to reduce the inter-core communications with little accuracy loss. We evaluate our framework on typical datasets with multiple widely-used GCN models. Experimental results show that the proposed framework achieves $698\times, 89\times$ , and $41\times$ speedup with $7108\times,255\times$ , and $31\times$ energy efficiency enhancement compared with CPUs, GPUs, and ASICs, respectively.

TensorCache: Reconstructing Memory Architecture with SRAM-Based In-Cache Computing for Efficient Tensor Computations in GPGPUs

A design framework for processing-in-memory accelerator

Advanced hybrid MRAM based novel GPU cache system for graphic processing with high efficiency

VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

PASGCN: An ReRAM-Based PIM Design for GCN With Adaptively Sparsified Graphs

Re-Cache: Mitigating Cache Contention by Exploiting Locality Characteristics with Reconfigurable Memory Hierarchy for GPGPUs.

Exploiting Parallelism with Vertex-Clustering in Processing-In-Memory-based GCN Accelerators

NAND-SPIN-based processing-in-MRAM architecture for convolutional neural network acceleration

G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory

An Energy-Efficient Floating-Point Compute SRAM with Pipelined In-Memory Bit-Parallel Exponent and Bitwise Mantissa Processing

Accelerating Neural Network Training with Processing-in-Memory GPU

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

A 28-nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for Floating-Point CNNs

An Energy Efficient Computing-in-Memory Accelerator With 1T2R Cell and Fully Analog Processing for Edge AI Applications