Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method.

Binjing Li,Siyuan Lu,Keli Xie,Zhongfeng Wang
DOI: https://doi.org/10.1109/isvlsi54635.2022.00092
2022-01-01
Abstract:In recent years, Natural Language Processing (NLP) has gradually become a heated topic in research area, and Transformer-based pretrained models (the most widely used is BERT) have achieved state-of-the-art results in many NLP tasks. But since Transformer-based pretrained models always contain extensive parameters, and consume much time in computing, it is really difficult to employ them on resource-limited embedded platform or mobile devices. To resolve this issue, we utilize the ALBERT model with the improved early exit method (ELBERT) and propose an efficient VLSI architecture for it in an algorithm and hardware co-design way. First of all, by using the quantization and encoder-level parameter sharing techniques, the storage space for saving BERT is reduced from 1208.88 MB to 20.99MB with little accuracy loss, which makes it possible to store all the weights on-chip. Secondly, we present a hardware-friendly design for the improved early-exit method. In contrast to the original ALBERT implementation, the accelerator combining ALBERT model with improved early exit can obtain a speed-up of 4.59x. Thirdly, owing to our efficient hardware design, experiments demonstrate that our FPGA accelerator can achieve a performance-per-watt of 1.29 fps/W, which is 25.97x over NVIDIA 2080 Ti GPU. On the whole, we introduce an efficient accelerator on FPGA with compressed BERT and a hardware-oriented early exit method, which solves the challenge to deploy the BERT to resource-constrained edge platforms with strict latency and memory requirements.
What problem does this paper attempt to address?