Abstract:The growing complexity and diversity of neural networks in the fields of autonomous driving and intelligent robots have facilitated the research of many-core architectures, which can offer sufficient programming flexibility to simultaneously support multi-DNN parallel inference with different network structures and sizes compared to domain-specific architectures. However, due to the tight constraints of area and power consumption, many-core architectures typically use lightweight scalar cores without vector units and are almost unable to meet the high-performance computing needs of multi-DNN parallel inference. To solve the above problem, we design an area- and energy-efficient many-core architecture by integrating large amounts of lightweight processor cores with RV32IMA ISA. The architecture leverages the emerging SRAM-based computing-in-memory technology to implement vector instruction extensions by reusing memory cells in the data cache instead of conventional logic circuits. Thus, the data cache in each core can be reconfigured as the memory part and the computing part with the latter tightly coupled with the core pipeline, enabling parallel execution of the basic RISC-V instructions and the extended multi-cycle vector instructions. Furthermore, a corresponding execution framework is proposed to effectively map DNN models onto the many-core architecture by using intra-layer and inter-layer pipelining, which potentially supports multi-DNN parallel inference. Experimental results show that the proposed MAICC architecture obtains a 4.3 × throughput and 31.6 × energy efficiency over CPU (Intel i9-13900k). MAICC also achieves a 1.8 × energy efficiency over GPU (RTX 4090) with only 4MB on-chip memory and 28 mm2 area.

A Scalable Multi-TeraOPS Core for AI Training and Inference

A Scalable Multi-TeraOPS Core for AI Training and Inference

DaDianNao: A Machine-Learning Supercomputer

A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

High Performance Scalable FPGA Accelerator for Deep Neural Networks

RaPiD: AI Accelerator for Ultra-low Precision Training and Inference

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

A Scalable Multi-Chiplet Deep Learning Accelerator with Hub-Side 2.5D Heterogeneous Integration.

120 GOPS Photonic Tensor Core in Thin-film Lithium Niobate for Inference and in-situ Training

Ifpna: A Flexible and Efficient Deep Neural Network Accelerator with a Programmable Data Flow Engine in 28nm CMOS.

PAICORE: A 1.9-Million-neuron 5.181-TSOPS/W Digital Neuromorphic Processor with Unified SNN-ANN and On-Chip Learning Paradigm

22.1 A 12.4TOPS/W @ 136GOPS AI-IoT System-on-Chip with 16 RISC-V, 2-to-8b Precision-Scalable DNN Acceleration and 30%-Boost Adaptive Body Biasing

Hypermultiplexed Integrated Tensor Optical Processor

All-rounder: A flexible DNN accelerator with diverse data format support

A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

MAICC : A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference.

A Small-Footprint Accelerator for Large-Scale Neural Networks

A High Energy-Efficiency Multi-core Neuromorphic Architecture for Deep SNN Training