Abstract:Field-programmable gate arrays (FPGAs) can efficiently implement custom applications via their embedded digital signal processor (DSP) slices, including binary multipliers. An increasing number of binary multipliers belonging to a DSP slice usually demonstrate that it has the capacity to process as many multiplication operations as possible in one clock cycle. In order to fully utilize the DSP resource, in this paper, we propose a novel DSP slice optimization method to achieve parallel multiplication on single DSP slice, namely PMSDS. First, the PMSDS splits multiplicators into two separate parts, i.e., valid bits and vacant bits, using a customized polynomial algebra method. Then, the PMSDS pre-calculates the maximum number of overflow bits combining the above-mentioned polynomial algebra method. Finally, it computes the total multiplicators' bit numbers and parallel the final multiplicators. We also propose an optimization model to find the best parallel solution according to the performance and precision of a single DSP slice. Moreover, we implement a PMSDS-based matrix multiplication algorithm supporting the computing precision dynamically changing. The experiments based on a large-scale and real-world matrix multiplication show that the PMSDS has better performance in latency and resource utilization than the traditional, add-tree, and full-unroll methods and is more outstanding in frequency and dynamic power consumption comparing with the state-of-the-art methods.

Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture

A Low Latency High Throughput Multiply-accumulator Unit for Float Point and Integer

A reconfigurable macro-pipelined systolic accelerator architecture

Design of Field Programmable Gate Array Based Real-Time Double-Precision Floating-Point Matrix Multiplier

Floating-Point Multiply-Accumulative Processing Element on FPGAs

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs

Fast and Practical Strassen's Matrix Multiplication using FPGAs

Design and Implementation of Floating-Point Multiply-Accumulate Processing Element under SMVM System

MALMM: A Multi-Array Architecture for Large-Scale Matrix Multiplication on FPGA.

A Universal FPGA-based Floating-Point Matrix Processor for Mobile Systems

Configurable sparse matrix - matrix multiplication accelerator on FPGA: A systematic design space exploration approach with quantization effects

An Efficient Method of Parallel Multiplication on a Single DSP Slice for Embedded FPGAs

Scalable Systolic Array Multiplier Optimized by Sparse Matrix.

Parallel Photonic Acceleration Processor for Matrix-Matrix Multiplication

Ethernet based multi-FPGA matrix multiplication parallel computing system design

Run-Time-Reconfigurable Multi-Precision Floating-Point Matrix Multiplier Intellectual Property Core on FPGA

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

A reconfigurable macro-pipelined DCT/IDCT accelerator

The design of multiple-precision floating-point multiplier with SIMD support

Research of High-Speed Pipelined Floating-Point Multipfier Design

AnScalable Matrix Computing Unit Architecture for FPGA,and SCUMO User Design Interface