Abstract:Block-circulant matrix (BCM) compression has garnered much attention in the hardware acceleration of convolutional neural networks (CNNs) due to its regularity and efficiency. However, constrained by the difficulty of exploring the compression parameter space, existing BCM-based methods often apply a uniform compression parameter to all CNN models’ layers, losing the compression’s flexibility. Additionally, independently optimizing models or accelerators makes achieving the optimal tradeoff between model accuracy and hardware efficiency challenging. To this end, we propose FlexBCM, a joint exploration framework that efficiently explores both the parameter compression and hardware parameter space to generate customized hybrid BCM-compressed CNN and field-programmable gate array (FPGA) accelerator solutions. On the algorithmic side, leveraging the idea of neural architecture search (NAS), we design an efficient differentiable sampling method to rapidly evaluate the accuracy of candidate subnets. Additionally, we devise a hardware-friendly frequency domain quantization scheme for BCM computation. On the hardware side, we develop the efficient and parameter-configurable convolutional core (ConvPU) alongside the BCM computing core (BCMPU). The BCMPU can flexibly accommodate different compression parameters at runtime, incorporate complex-number DSP packing and conjugate symmetry optimizations. For model-to-hardware evaluation, we construct accurate latency and resource consumption models. Moreover, we design a fast hardware generation algorithm based on the coarse-grained search to provide prompt feedback on the hardware evaluation of the current subnet. Finally, we validate FlexBCM on the Xilinx ZCU102 FPGA and compare its compressed CNN-accelerator solutions with previous state-of-the-art works. Experimental results demonstrate that FlexBCM achieves 1.21–3.02 times higher-computational efficiency for ResNet18 and ResNet34 models while maintaining an acceptable accuracy loss on the ImageNet dataset.

BOOST: Block Minifloat-Based On-Device CNN Training Accelerator with Transfer Learning

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

High-Performance FPGA-Based CNN Accelerator with Block-Floating-Point Arithmetic.

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks

FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAs

HBCA: A Toolchain for High-Accuracy Branch-Fused CNN Accelerator on FPGA with Dual-Decimal-Fused Technique

Fine-structural cytochemistry of the centriolar adjunct in grasshopper spermatids.

A Parallel Loading Based Accelerator for Convolution Neural Network

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

Compressed CNN Training with FPGA-based Accelerator

A FPGA-based end-to-end acceleration framework for fast deployment of Convolutional Neural Networks

Optimizing CNN Hardware Acceleration with Configurable Vector Units and Feature Layout Strategies

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks

A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks

Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays

Hardware-Efficient Template-Based Deep CNNs Accelerator Design

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs