Abstract:Deep neural networks (DNNs) have made significant achievements in a wide variety of domains. For the deep learning tasks, multiple excellent hardware platforms provide efficient solutions, including graphics processing units (GPUs), central processing units (CPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuit (ASIC). Nonetheless, CPUs outperform other solutions including GPUs in many cases for the inference workload of DNNs with the support of various techniques, such as the high-performance libraries being the basic building blocks for DNNs. Thus, CPUs have been a preferred choice for DNN inference applications, particularly in the low-latency demand scenarios. However, the DNN inference efficiency remains a critical issue, especially when low latency is required under conditions with limited hardware resources, such as embedded systems. At the same time, the hardware features have not been fully exploited for DNNs and there is much room for improvement. To this end, this paper conducts a series of experiments to make a thorough study for the inference workload of prominent state-of-the-art DNN architectures on a single-instruction-multiple-data (SIMD) CPU platform, as well as with widely applicable scopes for multiple hardware platforms. The study goes into depth in DNNs: the CPU kernel-instruction level performance characteristics of DNNs including branches, branch prediction misses, cache misses, etc, and the underlying convolutional computing mechanism at the SIMD level; The thorough layer-wise time consumption details with potential time-cost bottlenecks; And the exhaustive dynamic activation sparsity with exact details on the redundancy of DNNs. The research provides researchers with comprehensive and insightful details, as well as crucial target areas for optimising and improving the efficiency of DNNs at both the hardware and software levels.

An Efficient Hardware Architecture for Activation Function in Deep Learning Processor

Design Space Exploration of Neural Network Activation Function Circuits

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

HAO: Hardware-aware neural Architecture Optimization for Efficient Inference

Implementation and Optimization of the Accelerator Based on FPGA Hardware for LSTM Network

An architecture-level analysis on deep learning models for low-impact computations

Hardware-Efficient Template-Based Deep CNNs Accelerator Design

Hardware Accelerated Optimization of Deep Learning Model on Artificial Intelligence Chip

Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology

An Energy-Efficient Deep Belief Network Processor Based on Heterogeneous Multi-Core Architecture With Transposable Memory and On-Chip Learning

Accelerating DNN Inference with Heterogeneous Multi-DPU Engines

High-Performance Method and Architecture for Attention Computation in DNN Inference

A Compact and Configurable Long Short-Term Memory Neural Network Hardware Architecture.

A Programmable Artificial Neural Network Coprocessor for Handwritten Digit Recognition

WPU: A FPGA-based Scalable, Efficient and Software/Hardware Co-design Deep Neural Network Inference Acceleration Processor

OCEAN: an On-Chip Incremental-Learning Enhanced Processor with Gated Recurrent Neural Network Accelerators.