Abstract:Due to the low-power priority of analog delay-based computation, time-domain computing-in-memory (TD-CIM) presents a splendid potential for energy-constrained edge and IoT scenarios deploying convolutional neural networks (CNNs). However, the latency in delay-based computation is proportional to the numbers and values of multiplications-and-accumulations (MACs), bottlenecking the throughput of previous data-agnostic TD-CIM-based processors which compute complete convolutions in a fixed MAC mapping manner. First, some output activations in each layer of CNNs contribute less to the final classification results, which are insignificant and can be substituted by sums of partial MACs, with a marginal accuracy degradation. Thus, complete convolution computations lead to redundant MACs. Second, activations and weights vary with input images and models. Fixed MAC mapping leads to unbalanced MAC values on delay chains, causing long idle time and latency. To address that, we design a data-aware TD-CIM-based CNN processor, DATIC, with three techniques to reduce latency: 1) a channel-skipping TD-CIM macro to remove redundant MACs for insignificant output activations (IOAs), by storing activations stationary in SRAM bitcells and shifting weights to perform only imperative MACs; 2) a convolution-order programming unit to reduce overhead of skipping redundant MACs for IOAs with random positions on feature maps; and 3) an activation-weight-adaptive channel-mapping scheduler to balance the latency of delay chains by dynamically altering the convolution mapping manner. Implemented under TSMC 28-nm technology, DATIC achieves 622.9-GOPS throughput and 32.7-TOPS/W energy efficiency for ResNet-18 with 2-b weights and 8-b activations.

MACA: Memory-aware Convolution Accelerating for CNN Inference on Edge Devices

MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning Applications

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A High Efficient Architecture for Convolution Neural Network Accelerator

Memory System Designed for Multiply-Accumulate (MAC) Engine Based on Stochastic Computing.

ABM-SpConv-SIMD: Accelerating Convolutional Neural Network Inference for Industrial IoT Applications on Edge Devices

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters

Optimizing Stochastic Computing for Low Latency Inference of Convolutional Neural Networks

DATIC: A Data-Aware Time-Domain Computing-in-Memory-Based CNN Processor with Dynamic Channel Skipping and Mapping

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs.

Communication Minimized Model-Architecture Co-design for Efficient Convolution Acceleration

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

DeeperThings: Fully Distributed CNN Inference on Resource-Constrained Edge Devices

Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal Balance

Exploration of Balanced Design in Resource-Constrained Edge Device for Efficient CNNs

IECA: An In-Execution Configuration CNN Accelerator With 30.55 GOPS/mm² Area Efficiency

A Parallel Loading Based Accelerator for Convolution Neural Network

Accelerate Convolutional Neural Network With A Customized Vliw Dsp

An Energy-Efficient Mixed-Signal Parallel Multiply-Accumulate (MAC) Engine Based on Stochastic Computing

A Low-Power Hardware Architecture for Real-Time CNN Computing

Convolutional Neural Network Accelerator Architecture Design for Ultimate Edge Computing Scenario