Abstract:The ever-increasing computational complexity and energy consumption of today’s applications, such as machine learning (ML) algorithms, not only strain the capabilities of the underlying hardware but also significantly restrict their wide deployment at the edge. Addressing these challenges, novel architecture solutions are required by leveraging opportunities exposed by algorithms, e.g., robustness to small-bitwidth operand quantization and high intrinsic data-level parallelism. However, traditional hardware single instruction multiple data (Hard SIMD) architectures only support a small set of operand bitwidths, limiting performance improvement. To fill the gap, this manuscript introduces a novel pipelined processor microarchitecture for arithmetic computing based on the software-defined SIMD (Soft SIMD) paradigm that can define arbitrary SIMD modes through control instructions at run-time. This microarchitecture is optimized for parallel fine-grained fixed-point arithmetic, such as shift/add. It can also efficiently execute sequential shift-add-based multiplication over SIMD subwords, thanks to zero-skipping and canonical signed digit (CSD) coding. A lightweight repacking unit allows changing subword bitwidth dynamically. These features are implemented within a tight energy and area budget. An energy consumption model is established through post-synthesis for performance assessment. We select heterogeneously quantized (HQ) convolutional neural networks (CNNs) from the ML domain as the benchmark and map it onto our microarchitecture. Experimental results showcase that our approach dramatically outperforms traditional Hard SIMD Multiplier-Adder regarding area and energy requirements. In particular, our microarchitecture occupies up to 59.9% less area than a Hard SIMD that supports fewer SIMD bitwidths, while consuming up to 50.1% less energy on average to execute HQ CNNs.

SSiMD: Supporting Six Signed Multiplications in a DSP Block for Low-Precision CNN on FPGAs

A Reconfigurable Multiplier for Signed Multiplications with Asymmetric Bit-Widths.

A Reconfigurable Approximate Multiplier for Quantized CNN Applications.

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

PIR-DSP: an FPGA DSP Block Architecture for Multi-precision Deep Neural Networks

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

A FPGA Embedded DSP Supporting Parallel Multiple Low Bit-Width Multiply-Accumulate Operations

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

LUT‐DSP usage trade‐off for re‐configurable convolution acceleration core based on small logarithmic floating point representation

High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

DSP-Packing: Squeezing Low-precision Arithmetic into FPGA DSP Blocks

High-Speed Energy-Efficient Fixed-Point Signed Multipliers for FPGA-Based DSP Applications

A-DSCNN: Depthwise Separable Convolutional Neural Network Inference Chip Design Using an Approximate Multiplier

An Energy Efficient Soft SIMD Microarchitecture and Its Application on Quantized CNNs

SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable Accuracy

MCU-MixQ: A HW/SW Co-optimized Mixed-precision Neural Network Design Framework for MCUs

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

A Multiplier Structure Based on A Novel Real-Time Csd Recoding

Mandheling: Mixed-Precision On-Device DNN Training with DSP Offloading

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

Resource-aware Montgomery modular multiplication optimization for digital signal processing