Abstract:With recent advancing of Internet of Things (IoTs), it becomes very attractive to implement the deep convolutional neural networks (DCNNs) onto embedded/portable systems. Presently, executing the software-based DCNNs requires high-performance server clusters in practice, restricting their widespread deployment on the mobile devices. To overcome this issue, considerable research efforts have been conducted in the context of developing highly-parallel and specific DCNN hardware, utilizing GPGPUs, FPGAs, and ASICs. Stochastic Computing (SC), which uses bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has a high potential for implementing DCNNs with high scalability and ultra-low hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power/energy and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources bring about immense design space for enhancing scalability and robustness for hardware DCNNs. This paper presents the first comprehensive design and optimization framework of SC-based DCNNs (SC-DCNNs). We first present the optimal designs of function blocks that perform the basic operations, i.e., inner product, pooling, and activation function. Then we propose the optimal design of four types of combinations of basic function blocks, named feature extraction blocks, which are in charge of extracting features from input feature maps. Besides, weight storage methods are investigated to reduce the area and power/energy consumption for storing weights. Finally, the whole SC-DCNN implementation is optimized, with feature extraction blocks carefully selected, to minimize area and power/energy consumption while maintaining a high network accuracy level.

SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge Servers

DaDianNao: A Machine-Learning Supercomputer

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

SoC-Cluster As an Edge Server: an Application-driven Measurement Study

More is Different: Prototyping and Analyzing a New Form of Edge Server with Massive Mobile SoCs

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

swFLOW: A large-scale distributed framework for deep learning on Sunway TaihuLight supercomputer

Pflow: An end-to-end heterogeneous acceleration framework for CNN inference on FPGAs

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters

A High Energy-Efficiency Multi-core Neuromorphic Architecture for Deep SNN Training

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters

HierTrain: Fast Hierarchical Edge AI Learning with Hybrid Parallelism in Mobile-Edge-Cloud Computing

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters

Collaborative edge computing for distributed CNN inference acceleration using receptive field-based segmentation

SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing

Distributed Convolutional Neural Network Training on Mobile and Edge Clusters

ElasticFlow: an Elastic Serverless Training Platform for Distributed Deep Learning.

Energy-Efficient GPU Clusters Scheduling for Deep Learning