Abstract:Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA's computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer ( NCIMD ) is also developed to support automatic deployment and mapping of DNN networks. NCIMD provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to in our proposed CIM FPGA architecture.

What problem does this paper attempt to address?

The paper aims to address the energy efficiency issues in deep learning acceleration, particularly for the deployment and application of Internet of Things (IoT) terminal devices. The authors propose a fully digital Compute-In-Memory (CIM) Field Programmable Gate Array (FPGA) architecture to enhance the acceleration performance of Deep Neural Networks (DNNs). The main issues include: 1. **Low energy efficiency of traditional FPGAs**: Due to the significant energy consumption of data transfer, traditional FPGAs are not highly energy-efficient for deep learning acceleration. 2. **Limitations of existing hardware platforms**: Mainstream hardware platforms such as Graphics Processing Units (GPUs), Application-Specific Integrated Circuits (ASICs), and FPGAs each have their limitations. For example, while GPUs have superior computational performance, they are constrained by the von Neumann architecture. ASICs, although energy-efficient, have high customization costs and are difficult to adapt to algorithm changes. To address the above issues, the paper proposes the following solutions: - **Fully digital CIM FPGA architecture**: Combines storage units and computing units, enabling direct execution of computational tasks in memory, significantly reducing the energy consumption caused by data movement. - **Bit-serial computing circuits**: Used to accelerate Vector-Matrix Multiplication (VMM) operations. - **NCIMD toolchain**: Supports automatic deployment and mapping of deep neural networks to the proposed CIM FPGA architecture and provides a user-friendly API to support different formats of DNN models. Through experimental testing, compared to traditional FPGA architectures, the proposed CIM FPGA architecture can improve energy efficiency by up to 16.1 times and reduce latency by up to 40%. This indicates that the architecture has significant advantages in deep learning acceleration.

An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

DaDianNao: A Machine-Learning Supercomputer

A Reconfigurable Computing-in-Memory Accelerator with Dynamic Group-Based Dataflow and Dual-Input Macro Designs

Simulation of a Fully Digital Computing-in-Memory for Non-Volatile Memory for Artificial Intelligence Edge Applications

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

FPGA-Based High-Performance Data Compression Deep Neural Network Accelerator

New paradigm of FPGA-based computational intelligence from surveying the implementation of DNN accelerators

Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays

Designing Deep Learning Hardware Accelerator and Efficiency Evaluation

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

Adaptive design and implementation of automatic modulation recognition accelerator

A Deep Residual Networks Accelerator on FPGA

AFPR-CIM: An Analog-Domain Floating-Point RRAM-based Compute-In-Memory Architecture with Dynamic Range Adaptive FP-ADC

IDLA: an Instruction-based Adaptive CNN Accelerator

FP-DNN: an Automated Framework for Mapping Deep Neural Networks Onto FPGAs with RTL-HLS Hybrid Templates

FPGA Implementations of 3D-SIMD Processor Architecture for Deep Neural Networks Using Relative Indexed Compressed Sparse Filter Encoding Format and Stacked Filters Stationary Flow

A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA

Deep neural network accelerator based on FPGA

DLAU: A Scalable Deep Learning Accelerator Unit on FPGA.