Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

Liu He,Yujin Wang,Zongle Huang,Shupei Fan,Chen Tang,Shuyuan Zhang,Luchang Lei,Huazhong Yang,Yongpan Liu,Hongyang Jia
DOI: https://doi.org/10.1109/socc62300.2024.10737793
2024-01-01
Abstract:Transformer-based neural networks (NNs) prevail in today’s artificial intelligence applications, including autonomous driving, natural language processing and generative modeling, showing superior accuracy and generalization over traditional deep-learning models. However, the quadratic scaling computation and complex dataflow in the self-attention set challenges to the efficient deployment of Transformer-based NNs on edge and edge-server devices, where the latency of single-batch inference is a critical concern. The lack of data parallelism necessitates exploring more dimensions in tensor parallelism, more specifically, sequence parallelism in transformer inference for strong scaling in domain-specific accelerator (DSA) design, which is non-trivial due to the temporal dependency of the max-finding in softmax operators. This work formulates these challenges into an on-chip buffering problem, and then puts forward a hardware-software co-design approach exploiting max-findingfree approximation for softmax operators, which removes the blocking of the inference pipeline and thus alleviates the onchip buffering pressure. An example architecture design shows up to $2.83 \times$ and $28.02 \times$ speedup, over the baseline DSA designs respectively, with negligible algorithmic performance loss.
What problem does this paper attempt to address?