LauWS: Local Adaptive Unstructured Weight Sparsity of Load Balance for DNN in Near-Data Processing

Zixu Li,Wang,Xin Zhong,Manni Li,Jiayu Yang,Yinyin Lin,Guhyun Kim,Yosub Song,Chengchen Wang,Xiankui Xiong
DOI: https://doi.org/10.1109/iscas58744.2024.10558554
2024-01-01
Abstract:Memory wall issue has become the overwhelming bottleneck of future systems due to the explosive parameter growth and low computing density large language model (LLM). Near-data processing (NDP) could alleviate data traffic and energy consumption, but the storage demand of LLM is still enormous. Weight sparsity is helpful for reducing data capacity. Unstructured sparsity sacrifices less accuracy compared to structured one, but the random non-zero values distribution in NDP leads to load imbalance among parallel processing units. Here we propose LauWS which is seamlessly combined into various prior arts of sparsity. LauWS follows the local characteristics of feature distribution in weight matrix for various models, preserving even tiny features and discarding non-feature values as far as possible region by region. That is the key for LauWS achieving a trade-off between high prune ratio (PR) and less accuracy loss (AL). Evaluations are carried out based on a GDDR6-based bank-NDP system. The typical optimization compared to the no-prune includes 38% speedup at 0.8PR with no AL for MLP, 22.7% speedup at 0.5PR with no AL for GPT-2, 23.6% speedup at 0.5PR with the lowest perplexity for OPT-125m.
What problem does this paper attempt to address?