Abstract:Accurate and efficient Machine Learning algorithms are of vital importance to many problems, especially on classification or clustering tasks but need a universal AI model standard. Unifying machine learning models into a common ecosystem can lead to less development time and better framework interoperability. ONNX (Open Neural Network Exchange Format) is a popular open format to represent deep learning models so that AI developers can more easily move models between state-of-the-art tools. On top of that, hardware companies such as Nvidia or Intel try to keep up with this trend and produce hardware-optimized runtimes (i.e. for CPUs, GPUs, FPGAs) that can handle these open format AI models like ONNX. That enables developers to leverage an heterogeneous mix of hardware and use whichever AI framework they prefer. However, FPGAs have a more challenging solution strategy which as a platform it is also proven to address these kind of problems very efficiently in terms of performance and power. This work is based on an early development stage project which is called HLS4ML originally created for particle physics applications via the automatic generation of neural networks (NNs) for embedded Xilinx FPGAs. Our work involves a hardware-aware NN training and a generalized optimization scheme on top of HLS4ML that boosts the performance and power efficiency of this package and adds functionality for cloud FPGA firmware from any NN model. We start from the FPGA-oriented training of a model in Keras for image recognition, converting into the ONNX open format then porting and optimizing it for cloud FPGAs using a novel scheme with optimizations in host, memory and kernels while using multiple levels of network precision. To the best of our knowledge this is a novel approach that also achieves a speed-up of up to 102<math>×</math> over single CPU in performance and up to 5.5<math>×</math> over GPU in performance/watt.

FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems

In-Network Aggregation with Transport Transparency for Distributed Training

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Utilizing cloud FPGAs towards the open neural network standard

FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters

High Performance Scalable FPGA Accelerator for Deep Neural Networks

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs

[DL] A Survey of FPGA-based Neural Network Inference Accelerators

AddNet: Deep Neural Networks Using FPGA-Optimized Multipliers

A Survey of FPGA-Based Neural Network Accelerator

New paradigm of FPGA-based computational intelligence from surveying the implementation of DNN accelerators

An SSD-MobileNet Acceleration Strategy for FPGAs Based on Network Compression and Subgraph Fusion

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

FPX-NIC: An FPGA-Accelerated 4K Ultra-High-Definition Neural Video Coding System

Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology