Abstract:Deep neural networks (DNNs) have made significant achievements in a wide variety of domains. For the deep learning tasks, multiple excellent hardware platforms provide efficient solutions, including graphics processing units (GPUs), central processing units (CPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuit (ASIC). Nonetheless, CPUs outperform other solutions including GPUs in many cases for the inference workload of DNNs with the support of various techniques, such as the high-performance libraries being the basic building blocks for DNNs. Thus, CPUs have been a preferred choice for DNN inference applications, particularly in the low-latency demand scenarios. However, the DNN inference efficiency remains a critical issue, especially when low latency is required under conditions with limited hardware resources, such as embedded systems. At the same time, the hardware features have not been fully exploited for DNNs and there is much room for improvement. To this end, this paper conducts a series of experiments to make a thorough study for the inference workload of prominent state-of-the-art DNN architectures on a single-instruction-multiple-data (SIMD) CPU platform, as well as with widely applicable scopes for multiple hardware platforms. The study goes into depth in DNNs: the CPU kernel-instruction level performance characteristics of DNNs including branches, branch prediction misses, cache misses, etc, and the underlying convolutional computing mechanism at the SIMD level; The thorough layer-wise time consumption details with potential time-cost bottlenecks; And the exhaustive dynamic activation sparsity with exact details on the redundancy of DNNs. The research provides researchers with comprehensive and insightful details, as well as crucial target areas for optimising and improving the efficiency of DNNs at both the hardware and software levels.

A Comprehensive Analysis of Low-Impact Computations in Deep Learning Workloads

An architecture-level analysis on deep learning models for low-impact computations

Survey on Energy-Efficient Deep Neural Networks for Computer Vision

DACO: Pursuing Ultra-low Power Consumption Via DNN-Adaptive CPU-GPU CO-optimization on Mobile Devices

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

Energy-efficient Deployment of Deep Learning Applications on Cortex-M based Microcontrollers using Deep Compression

Model Compression for Deep Neural Networks: A Survey

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework

Low Rank Optimization for Efficient Deep Learning: Making A Balance between Compact Architecture and Fast Training

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices

Enabling High Performance Deep Learning Networks on Embedded Systems

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

Minimizing Area and Energy of Deep Learning Hardware Design Using Collective Low Precision and Structured Compression

Exploiting Neural-Network Statistics for Low-Power DNN Inference

Optimization of deep learning models: benchmark and analysis

Toward Efficient Execution of Mainstream Deep Learning Frameworks on Mobile Devices: Architectural Implications

Performance Analysis of DNN Inference/Training with Convolution and non-Convolution Operations

FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices

Progressive DNN Compression: A Key to Achieve Ultra-High Weight Pruning and Quantization Rates using ADMM