PP-Transformer: Enable Efficient Deployment of Transformers Through Pattern Pruning

Jialin Cao,Xuanda Lin,Manting Zhang,Kejia Shi,Jun Yu,Kun Wang
DOI: https://doi.org/10.1109/iccad57390.2023.10323836
2023-01-01
Abstract:Transformer models have been widely adopted in the field of Natural Language Processing (NLP) and Computer Vision (CV). However, the excellent performance of Transformers comes at the cost of heavy memory footprints and gigantic computing complexity. To deploy Transformers on resource constrained platforms, e.g., FPGA, diverse weight pruning strategies have been proposed. However, pattern pruning, as an alternative pruning method, is not well explored in the context of Transformers. In this paper, we propose PP-Transformer, a framework specifically designed to efficiently deploy Transformer models on FPGA using pattern pruning. At the algorithm level, we leverage pattern pruning, a coarse-grained structured pruning strategy, to reduce parameter storage. Meanwhile, we have developed a dedicated hardware architecture, featuring a custom computing engine tailored to support pattern pruning algorithm. Experimental results demonstrate that our algorithm achieves up to $2.26\times$ reduction in parameter storage with acceptable accuracy degradation. Additionally, our hardware implementation exhibits $839.72\times$ and $5.72\times$ speedup in comparison to CPU and GPU implementations.
What problem does this paper attempt to address?