Abstract:Many long short-term memory (LSTM) applications need fast yet compact models. Neural network compression approaches, such as the grow-and-prune paradigm, have proved to be promising for cutting down network complexity by skipping insignificant weights. However, current compression strategies are mostly hardware-agnostic and network complexity reduction does not always translate into execution efficiency. In this work, we propose a hardware-guided symbiotic training methodology for compact, accurate, yet execution-efficient inference models. It is based on our observation that hardware may introduce substantial non-monotonic behavior, which we call the latency hysteresis effect, when evaluating network size vs. inference latency. This observation raises question about the mainstream smaller-dimension-is-better compression strategy, which often leads to a sub-optimal model architecture. By leveraging the hardware-impacted hysteresis effect and sparsity, we are able to achieve the symbiosis of model compactness and accuracy with execution efficiency, thus reducing LSTM latency while increasing its accuracy. We have evaluated our algorithms on language modeling and speech recognition applications. Relative to the traditional stacked LSTM architecture obtained for the Penn Treebank dataset, we reduce the number of parameters by 18.0x (30.5x) and measured run-time latency by up to 2.4x (5.2x) on Nvidia GPUs (Intel Xeon CPUs) without any accuracy degradation. For the DeepSpeech2 architecture obtained for the AN4 dataset, we reduce the number of parameters by 7.0x (19.4x), word error rate from 12.9% to 9.9% (10.4%), and measured run-time latency by up to 1.7x (2.4x) on Nvidia GPUs (Intel Xeon CPUs). Thus, our method yields compact, accurate, yet execution-efficient inference models.

Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training

Binarized LSTM Language Model.

Neural Network Language Model Compression with Product Quantization and Soft Binarization

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

Structured Word Embedding For Low Memory Neural Network Language Model

Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers

SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

TB-DNN: A Thin Binarized Deep Neural Network with High Accuracy

Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM

Large Language Model Compression with Neural Architecture Search

Aggressive Post-Training Compression on Extremely Large Language Models

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Training LLMs over Neurally Compressed Text

Efficient Super Resolution Using Binarized Neural Network

Compressing Neural Language Models by Sparse Word Representations

High Efficiency Image Compression for Large Visual-Language Models

A Highly Efficient Training-Aware Convolutional Neural Network Compression Paradigm

Deep Learning Model Compression with Rank Reduction in Tensor Decomposition.

A Model Compression Method with Matrix Product Operators for Speech Enhancement

Variator: Accelerating Pre-trained Models with Plug-and-Play Compression Modules

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference