Abstract:Deep learning and Convolutional Neural Network (CNN) have becoming increasingly more popular and important in both academic and industrial areas in recent years cause they are able to provide better accuracy and result in classification, detection and recognition areas, compared to traditional approaches. Currently, there are many popular frameworks in the market for deep learning development, such as Caffe, TensorFlow, Pytorch, and most of frameworks natively support CPU and consider GPU as the mainline accelerator by default. FPGA device, viewed as a potential heterogeneous platform, still cannot provide a comprehensive support for CNN development in popular frameworks, in particular to the training phase. In this paper, we firstly propose the FeCaffe, i.e. FPGA-enabled Caffe, a hierarchical software and hardware design methodology based on the Caffe to enable FPGA to support mainline deep learning development features, e.g. training and inference with Caffe. Furthermore, we provide some benchmarks with FeCaffe by taking some classical CNN networks as examples, and further analysis of kernel execution time in details accordingly. Finally, some optimization directions including FPGA kernel design, system pipeline, network architecture, user case application and heterogeneous platform levels, have been proposed gradually to improve FeCaffe performance and efficiency. The result demonstrates the proposed FeCaffe is capable of supporting almost full features during CNN network training and inference respectively with high degree of design flexibility, expansibility and reusability for deep learning development. Compared to prior studies, our architecture can support more network and training settings, and current configuration can achieve 6.4x and 8.4x average execution time improvement for forward and backward respectively for LeNet.

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

US-Byte: an Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

Fast Coflow Scheduling Via Traffic Compression and Stage Pipelining in Datacenter Networks

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

DVFO: Learning-Based DVFS for Energy-Efficient Edge-Cloud Collaborative Inference

ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG

BatOpt: Optimizing GPU-Based Deep Learning Inference Using Dynamic Batch Processing

CEFS: compute-efficient flow scheduling for iterative synchronous applications

BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge Devices

KubFBS: A fine‐grained and balance‐aware scheduling system for deep learning tasks based on kubernetes

Multi-user Co-inference with Batch Processing Capable Edge Server

Dynamic Space-Time Scheduling for GPU Inference

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

CPN Meets Learning: Online Scheduling for Inference Service in Computing Power Network

FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline

GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads

Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Differentiate Quality of Experience Scheduling for Deep Learning Inferences with Docker Containers in the Cloud

Energy-Efficient GPU Clusters Scheduling for Deep Learning

FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on Intel Stratix 10