Abstract:Deep neural networks (DNNs) have made significant achievements in a wide variety of domains. For the deep learning tasks, multiple excellent hardware platforms provide efficient solutions, including graphics processing units (GPUs), central processing units (CPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuit (ASIC). Nonetheless, CPUs outperform other solutions including GPUs in many cases for the inference workload of DNNs with the support of various techniques, such as the high-performance libraries being the basic building blocks for DNNs. Thus, CPUs have been a preferred choice for DNN inference applications, particularly in the low-latency demand scenarios. However, the DNN inference efficiency remains a critical issue, especially when low latency is required under conditions with limited hardware resources, such as embedded systems. At the same time, the hardware features have not been fully exploited for DNNs and there is much room for improvement. To this end, this paper conducts a series of experiments to make a thorough study for the inference workload of prominent state-of-the-art DNN architectures on a single-instruction-multiple-data (SIMD) CPU platform, as well as with widely applicable scopes for multiple hardware platforms. The study goes into depth in DNNs: the CPU kernel-instruction level performance characteristics of DNNs including branches, branch prediction misses, cache misses, etc, and the underlying convolutional computing mechanism at the SIMD level; The thorough layer-wise time consumption details with potential time-cost bottlenecks; And the exhaustive dynamic activation sparsity with exact details on the redundancy of DNNs. The research provides researchers with comprehensive and insightful details, as well as crucial target areas for optimising and improving the efficiency of DNNs at both the hardware and software levels.

Research on Parallel Acceleration for Deep Learning Inference Based on Many-Core ARM Platform.

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

DaDianNao: A Machine-Learning Supercomputer

High performance and energy efficient inference for deep learning on ARM processors

Hardware Accelerated Optimization of Deep Learning Model on Artificial Intelligence Chip

A deep learning image recognition framework accelerator based parallel computing

Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and Pitfalls

CAP: Communication-aware Automated Parallelization for Deep Learning Inference on CMP Architectures

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy

Research on Deep Learning Acceleration Technique

An architecture-level analysis on deep learning models for low-impact computations

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

A Survey of Accelerator Architectures for Deep Neural Networks

A dynamic parallel method for performance optimization on hybrid CPUs

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing