Abstract:Machine learning has been widely applied in various emerging data-intensive applications, and has to be optimized and accelerated by powerful engines to process very large scale data. Recently, the instruction set based accelerators on Field Progarmmable Gate Arrays (FPGAs) have been a promising topic for machine learning applications. The customized instructions can be further scheduled to achieve higher instruction-level parallelism. In this article, we design a ubiquitous accelerator with out-of-order automatic parallelization for large-scale data-intensive applications. The accelerator accommodates four representative applications, including clustering algorithms, deep neural networks, genome sequencing, and collaborative filtering. In order to improve the coarse-grained instruction-level parallelism, the accelerator employs an out-of-order scheduling method to enable parallel dataflow computation. We use Colored Petri Net (CPN) tools to analyze the dependences in the applications, and build a hardware prototype on the real FPGA platform. For cluster applications, the accelerator can support four different algorithms, including K-Means, SLINK, PAM, and DBSCAN. For collaborative filtering applications, it accommodates Tanimoto, euclidean, Cosine, and Pearson Correlation as Similarity metrics. For deep learning applications, we implement hardware accelerators for both training process and inference process. Finally, for genome sequencing, we design a hardware accelerator for the BWA-SW algorithm. Experimental results show that the accelerator architecture can reach up to 25X speedup against Intel processors with affordable hardware cost, insignificant power consumption, and high flexibility.

Fastlanes: An Fpga Accelerated Gpu Microarchitecture Simulator

Accelerating RTL Simulation with GPUs

Accelerating GPGPU Architecture Simulation.

Acceleration for Timing-Aware Gate-Level Logic Simulation with One-Pass GPU Parallelism

GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation

MALOC: A Fully Pipelined FPGA Accelerator for Convolutional Neural Networks with All Layers Mapped on Chip

Gpu-Accelerated Non-Linear Analog and Mixed-Signal Circuit Transient Simulation

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Logic Simulation Acceleration Based on GPU

Fast FPGA System for Microarchitecture Optimization on Synthesizable Modern Processor Design

QuantLaneNet: A 640-FPS and 34-GOPS/W FPGA-Based CNN Accelerator for Lane Detection

A Hardware Accelerate Simulator For Network Processor Based On Fpga

Exploring GPU-Accelerated Routing for FPGAs

FPGA-Based Feature Extraction and Tracking Accelerator for Real-Time Visual SLAM

GPU Accelerating for Rapid Multi-core Cache Simulation

CPGPUSim: A Multi-dimensional Parallel Acceleration Framework for RTL Simulation

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs

Gpu-Accelerated Evaluation Platform For High Fidelity Network Modeling

Adaptive Multidimensional Parallel Fault Simulation Framework on Heterogeneous System

Using GPU to Accelerate Cache Simulation.

A Statically and Dynamically Scalable Soft GPGPU