Abstract:Deep neural networks (DNNs) have made significant achievements in a wide variety of domains. For the deep learning tasks, multiple excellent hardware platforms provide efficient solutions, including graphics processing units (GPUs), central processing units (CPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuit (ASIC). Nonetheless, CPUs outperform other solutions including GPUs in many cases for the inference workload of DNNs with the support of various techniques, such as the high-performance libraries being the basic building blocks for DNNs. Thus, CPUs have been a preferred choice for DNN inference applications, particularly in the low-latency demand scenarios. However, the DNN inference efficiency remains a critical issue, especially when low latency is required under conditions with limited hardware resources, such as embedded systems. At the same time, the hardware features have not been fully exploited for DNNs and there is much room for improvement. To this end, this paper conducts a series of experiments to make a thorough study for the inference workload of prominent state-of-the-art DNN architectures on a single-instruction-multiple-data (SIMD) CPU platform, as well as with widely applicable scopes for multiple hardware platforms. The study goes into depth in DNNs: the CPU kernel-instruction level performance characteristics of DNNs including branches, branch prediction misses, cache misses, etc, and the underlying convolutional computing mechanism at the SIMD level; The thorough layer-wise time consumption details with potential time-cost bottlenecks; And the exhaustive dynamic activation sparsity with exact details on the redundancy of DNNs. The research provides researchers with comprehensive and insightful details, as well as crucial target areas for optimising and improving the efficiency of DNNs at both the hardware and software levels.

Analysis of Performance and Optimization in MindSpore on Ascend NPUs

Performance Evaluation of MindSpore and PyTorch Based on Ascend NPU

MOC: Multi-Objective Mobile CPU-GPU Co-Optimization for Power-Efficient DNN Inference

Multi-core Chip Dynamic Power Management Framework Based on Reinforcement Learning br

Performance Comparison between Pytorch and Mindspore

Machine Learning-enabled Performance Model for DNN Applications and AI Accelerator

MindSpore Quantum: A User-Friendly, High-Performance, and AI-Compatible Quantum Computing Framework

An architecture-level analysis on deep learning models for low-impact computations

Fast Sparse Deep Neural Network Inference with Flexible SpMM Optimization Space Exploration

Core Placement Optimization of Many-core Brain-Inspired Near-Storage Systems for Spiking Neural Network Training

AIbench: a Tool for Benchmarking Huawei Ascend AI Processors

EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

VPU-EM: An Event-based Modeling Framework to Evaluate NPU Performance and Power Efficiency at Scale

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

P/D-Serve: Serving Disaggregated Large Language Model at Scale

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability.

Benchmarking Keyword Spotting Efficiency on Neuromorphic Hardware

A Heterogeneous Full-stack AI Platform for Performance Monitoring and Hardware-specific Optimizations

A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference