Abstract:Machine learning has been widely applied in various emerging data-intensive applications, and has to be optimized and accelerated by powerful engines to process very large scale data. Recently, the instruction set based accelerators on Field Progarmmable Gate Arrays (FPGAs) have been a promising topic for machine learning applications. The customized instructions can be further scheduled to achieve higher instruction-level parallelism. In this article, we design a ubiquitous accelerator with out-of-order automatic parallelization for large-scale data-intensive applications. The accelerator accommodates four representative applications, including clustering algorithms, deep neural networks, genome sequencing, and collaborative filtering. In order to improve the coarse-grained instruction-level parallelism, the accelerator employs an out-of-order scheduling method to enable parallel dataflow computation. We use Colored Petri Net (CPN) tools to analyze the dependences in the applications, and build a hardware prototype on the real FPGA platform. For cluster applications, the accelerator can support four different algorithms, including K-Means, SLINK, PAM, and DBSCAN. For collaborative filtering applications, it accommodates Tanimoto, euclidean, Cosine, and Pearson Correlation as Similarity metrics. For deep learning applications, we implement hardware accelerators for both training process and inference process. Finally, for genome sequencing, we design a hardware accelerator for the BWA-SW algorithm. Experimental results show that the accelerator architecture can reach up to 25X speedup against Intel processors with affordable hardware cost, insignificant power consumption, and high flexibility.

The Masala Machine: Accelerating Thread-Intensive And Explicit Memory Management Programs With Dynamically Reconfigurable Fpgas

Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration

An Energy Efficient Floating Point Computing Infrastructure Embedding Ferroelectric Field Effect Transistor Based Ternary Content Addressable Memories

CSA-CiM: Enhancing Multi-Functional Computing-in-Memory with Configurable Sense Amplifiers

A design framework for processing-in-memory accelerator

MeMPA: A Memory Mapped M-SIMD Co-Processor to Cope with the Memory Wall Issue

Epuma Embedded Parallel DSP Processor with Unique Memory Access.

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

Compiling Halide Programs to Push-Memory Accelerators

RMP-MEM: A HW/SW Reconfigurable Multi-Port Memory Architecture for Multi-PEA Oriented CGRA.

Cache-emulated Register File: an Integrated On-Chip Memory Architecture for High Performance GPGPUs

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

Software Programmable Data Allocation in Multi-bank Memory of SIMD Processors.

Design and implementation of parallel multi-access memory interface

Thread Batching for High-performance Energy-efficient GPU Memory Design

Accelerating Attention Mechanism on FPGAs Based on Efficient Reconfigurable Systolic Array

FSHMEM: Supporting Partitioned Global Address Space on FPGAs for Large-Scale Hardware Acceleration Infrastructure

Automatic multidimensional memory partitioning for FPGA-based accelerators (abstract only).

Programmable FPGA-based Memory Controller

A Configurable Management Strategy for Parallel Access of Coarse-Grained Reconfigurable Architecture for Radar Processing

Hardware Thread Accelerating Method Based on CPU/FPGA Hybrid Architecture