Abstract:Machine learning has been widely applied in various emerging data-intensive applications, and has to be optimized and accelerated by powerful engines to process very large scale data. Recently, the instruction set based accelerators on Field Progarmmable Gate Arrays (FPGAs) have been a promising topic for machine learning applications. The customized instructions can be further scheduled to achieve higher instruction-level parallelism. In this article, we design a ubiquitous accelerator with out-of-order automatic parallelization for large-scale data-intensive applications. The accelerator accommodates four representative applications, including clustering algorithms, deep neural networks, genome sequencing, and collaborative filtering. In order to improve the coarse-grained instruction-level parallelism, the accelerator employs an out-of-order scheduling method to enable parallel dataflow computation. We use Colored Petri Net (CPN) tools to analyze the dependences in the applications, and build a hardware prototype on the real FPGA platform. For cluster applications, the accelerator can support four different algorithms, including K-Means, SLINK, PAM, and DBSCAN. For collaborative filtering applications, it accommodates Tanimoto, euclidean, Cosine, and Pearson Correlation as Similarity metrics. For deep learning applications, we implement hardware accelerators for both training process and inference process. Finally, for genome sequencing, we design a hardware accelerator for the BWA-SW algorithm. Experimental results show that the accelerator architecture can reach up to 25X speedup against Intel processors with affordable hardware cost, insignificant power consumption, and high flexibility.

FPGA-based Block Minifloat Training Accelerator for a Time Series Prediction Network

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

BOOST: Block Minifloat-Based On-Device CNN Training Accelerator with Transfer Learning

A Deep Learning Prediction Process Accelerator Based FPGA

High-Performance FPGA-Based CNN Accelerator with Block-Floating-Point Arithmetic.

Optimizing FPGA-Based DNN Accelerator with Shared Exponential Floating-Point Format

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks

SPAT: FPGA-based Sparsity-Optimized Spiking Neural Network Training Accelerator with Temporal Parallel Dataflow

An Efficient FPGA-based Accelerator for Deep Forest

EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data Reshaping for Online Adaptation or Personalization

Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays

Implementation and Optimization of the Accelerator Based on FPGA Hardware for LSTM Network

An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA

Training DNNs with Hybrid Block Floating Point

A Power Efficient Neural Network Implementation on Heterogeneous FPGA and GPU Devices

A Power-Efficient Accelerator Based on FPGAs for LSTM Network

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

FPGA-based Accelerator for Convolutional Neural Network

FPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support