DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers

Jie Tang,Shuai Wang,Song Chen,Yi Kang
DOI: https://doi.org/10.1109/iscas58744.2024.10558119
2024-01-01
Abstract:Feed-forward networks (FFNs) constitute two-thirds of the parameters in a Transformer model and account for over 60% of the computational cost. Recent works have aimed to compress FFNs to reduce the computational and memory overhead during inference. Various methods have been proposed, such as evaluating tokens to implement mixed precision for FFN compression and evaluating input vectors and FFN to compress the parameters. These approaches often require real-time evaluation, sometimes even with specialized hardware architecture for mixed precision. Evaluating inputs with all FFN parameters may also result in significant additional overhead in practical applications. Inspired by the observation of sparse activation in FFNs, we introduce a method called DP-FFN, which can split FFN into several functional partitions, and the computing of FFN is based on these partitions. DP-FFN is a two-stage computation approach: the first is to construct functional partitions by grouping frequently activated neurons, and the second is to conduct fine-grained computations using activated functional partitions to maintain model performance. Experimental results show that DP-FFN achieves 1.71X speedup over a baseline with about 2% accuracy loss while using only 20% of FFN parameters. Compared to a state-of-the-art reference, it achieves 1.4X speedup with almost the same accuracy and the same number of FFN parameters.
What problem does this paper attempt to address?