Efficient Deep Learning Inference Based on Model Compression.

Qing Zhang,Mengru Zhang,Mengdi Wang,Wanchen Sui,Chen Meng,Jun Yang,Weidan Kong,Xiaoyuan Cui,Wei Lin
DOI: https://doi.org/10.1109/CVPRW.2018.00221
2018-01-01
Abstract:Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which consists of a series of model compression methods, including Tensor Decomposition (TD), Graph Adaptive Pruning (GAP), Intrinsic Sparse Structures (ISS) in Long Short-Term Memory (LSTM), Knowledge Distillation (KD) and low-bit model quantization. We use different modeling scenarios to test our inference optimization pipeline with above mentioned methods, and it shows promising results to make inference more efficient with marginal loss of model accuracy.
What problem does this paper attempt to address?