Abstract:The 3D point cloud perception has emerged as a fundamental role for a wide range of applications. In particular, with the rapid development of neural networks, the voxel-based networks attract great attention due to their excellent performance. Various accelerator designs have been proposed to improve the hardware performance of voxel-based networks, especially to speed up the map search process. However, several challenges still exist including: (1) massive off-chip data access volume caused by map search operations, notably for high resolution and dense distribution cases, (2) frequent data movement for data-intensive convolution operations, (3) imbalanced workload caused by irregular sparsity of point data. To address the above challenges, we propose Voxel-CIM, an efficient Compute-in-Memory based accelerator for voxel-based neural network processing. To reduce off-chip memory access for map search, a depth-encoding-based output major search approach is introduced to maximize data reuse, achieving stable $O(N)$-level data access volume in various situations. Voxel-CIM also employs the in-memory computing paradigm and designs innovative weight mapping strategies to efficiently process Sparse 3D convolutions and 2D convolutions. Implemented on 22 nm technology and evaluated on representative benchmarks, the Voxel-CIM achieves averagely 4.5~7.0$\times$ higher energy efficiency (10.8 TOPS/w), and 2.4~5.4$\times$ speed up in detection task and 1.2~8.1$\times$ speed up in segmentation task compared to the state-of-the-art point cloud accelerators and powerful GPUs.

What problem does this paper attempt to address?

The main problem this paper attempts to address is improving the performance of voxel-based point cloud neural networks on hardware accelerators. Specifically, the paper proposes solutions to the following three major challenges: 1. **Large off-chip data access**: - **Mapping operations lead to large off-chip data access**: Before performing sparse convolution, it is necessary to construct input-output mapping tables (IN-OUT maps), which usually results in a large amount of off-chip data access, especially in high-resolution and densely distributed scenarios. - **Frequent data transfers**: In traditional von Neumann architecture, due to the "memory wall" problem, the large amount of data movement between computing units and storage units limits the processing speed of neural networks. 2. **Unbalanced workload**: - **Irregular sparsity of point cloud data**: Due to the randomness and uneven distribution of point cloud data, each weight corresponds to a different number of input-output pairs, leading to an unbalanced computational workload. The workload of central weights is usually higher, while the workload of edge weights is lower, resulting in low utilization of computational resources. To address these challenges, the paper proposes Voxel-CIM, an efficient accelerator based on Compute-in-Memory (CIM). Its main contributions include: - **Reducing off-chip data access**: Introducing a new search scheme called Depth-encoding-based Output Major Search (DOMS), which achieves stable $O(N)$ level off-chip memory access by maximizing data reuse. - **Designing CIM processing units and their weight mapping strategy**: Supporting efficient sparse 3D convolution (Spconv3D) and 2D convolution (Conv2D) computations, and proposing a Weight Workload Balanced (W2B) method to address workload mismatch issues. - **Performance evaluation**: Conducting comprehensive performance evaluations on detection and segmentation benchmarks, showing that Voxel-CIM improves energy efficiency by an average of 4.5~7.0 times (10.8 TOPS/W) compared to state-of-the-art point cloud accelerators and powerful GPUs, accelerates detection tasks by 2.4~5.4 times, and accelerates segmentation tasks by 1.2~8.1 times. In summary, this paper effectively addresses the key issues of hardware acceleration for voxel-based point cloud neural networks through innovative search methods and CIM architecture, significantly improving performance and energy efficiency.

Voxel-CIM: An Efficient Compute-in-Memory Accelerator for Voxel-based Point Cloud Neural Networks

An Efficient Accelerator for Point-based and Voxel-based Point Cloud Neural Networks

A 28-Nm Energy-Efficient Sparse Neural Network Processor for Point Cloud Applications Using Block-Wise Online Neighbor Searching

A 28nm 2D/3D Unified Sparse Convolution Accelerator with Block-Wise Neighbor Searcher for Large-Scaled Voxel-Based Point Cloud Network.

CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory Based Neural Network Accelerators

TensorCIM: Digital Computing-in-Memory Tensor Processor with Multichip-Module-Based Architecture for Beyond-NN Acceleration

A Spatial-Designed Computing-In-Memory Architecture Based on Monolithic 3D Integration for High-Performance Systems.

An Emerging NVM CIM Accelerator with Shared-Path Transpose Read and Bit-Interleaving Weight Storage for Efficient On-Chip Training in Edge Devices

Accelerating DNN-based 3D point cloud processing for mobile computing

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

Point-Voxel CNN for Efficient 3D Deep Learning

Benchmark of the Compute-in-Memory-Based DNN Accelerator With Area Constraint

DCIM-GCN: Digital Computing-in-Memory Accelerator for Graph Convolutional Network

An Efficient FPGA Accelerator for Point Cloud

A Systolic Computing-in-Memory Array Based Accelerator with Predictive Early Activation for Spatiotemporal Convolutions

Weight and Multiply-Accumulation Sparsity-Aware Non-Volatile Computing-in-Memory System

A Demonstration Platform for Large-Scaled Point Cloud Network Based on 28nm 2D/3D Unified Sparse Convolution Accelerator.

TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration.

SPCIM: Sparsity-Balanced Practical CIM Accelerator with Optimized Spatial-Temporal Multi-Macro Utilization

EF-CIM: an Endurance Friendly CIM Accelerator Using Embedded NVM with Bit-Aware Wear Leveling for Efficient Light-Weight On-Chip Training in Edge Devices

Multi Point-Voxel Convolution (MPVConv) for Deep Learning on Point Clouds