Abstract:Transformer neural networks (TNN) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV). Their widespread adoption has been primarily driven by the exceptional performance of their multi-head self-attention block used to extract key features from sequential data. The multi-head self-attention block is followed by feedforward neural networks, which play a crucial role in introducing non-linearity to assist the model in learning complex patterns. Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks. Most prior works have concentrated on sparse architectures that are not flexible for popular TNN variants. This paper introduces \textit{ProTEA}, a runtime programmable accelerator tailored for the dense computations of most of state-of-the-art transformer encoders. \textit{ProTEA} is designed to reduce latency by maximizing parallelism. We introduce an efficient tiling of large matrices that can distribute memory and computing resources across different hardware components within the FPGA. We provide run time evaluations of \textit{ProTEA} on a Xilinx Alveo U55C high-performance data center accelerator card. Experimental results demonstrate that \textit{ProTEA} can host a wide range of popular transformer networks and achieve near optimal performance with a tile size of 64 in the multi-head self-attention block and 6 in the feedforward networks block when configured with 8 parallel attention heads, 12 layers, and an embedding dimension of 768 on the U55C. Comparative results are provided showing \textit{ProTEA} is 2.5$\times$ faster than an NVIDIA Titan XP GPU. Results also show that it achieves 1.3 -- 2.8$\times$ speed up compared with current state-of-the-art custom designed FPGA accelerators.

A Hardware-efficient Accelerator for Encoding Stage of Text-to-speech Synthesis

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

EfficientSpeech: An On-Device Text to Speech Model

Efficient Binary Weight Convolutional Network Accelerator for Speech Recognition

FastSpeech: Fast, Robust and Controllable Text to Speech

Bidirectional Decoding Tacotron for Attention Based Neural Speech Synthesis

FPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks

A hardware accelerator for speech recognition applications

Accelerator-Aware Training for Transducer-Based Speech Recognition

TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

ProTEA: Programmable Transformer Encoder Acceleration on FPGA

A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs

Neural Speech Synthesis with Transformer Network.

Close to Human Quality TTS with Transformer.

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

TMA: Tera-MACs/W Neural Hardware Inference Accelerator with a Multiplier-less Massive Parallel Processor

A Spiking LSTM Accelerator for Automatic Speech Recognition Application Based on FPGA

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

Systolic-Array Deep-Learning Acceleration Exploring Pattern-Indexed Coordinate-Assisted Sparsity for Real-Time On-Device Speech Processing

Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention.