Abstract:This work introduces a highly efficient implementation of the transformer architecture on a Field-Programmable Gate Array (FPGA) by using the \texttt{hls4ml} tool. Given the demonstrated effectiveness of transformer models in addressing a wide range of problems, their application in experimental triggers within particle physics becomes a subject of significant interest. In this work, we have implemented critical components of a transformer model, such as multi-head attention and softmax layers. To evaluate the effectiveness of our implementation, we have focused on a particle physics jet flavor tagging problem, employing a public dataset. We recorded latency under 2 $\mu$s on the Xilinx UltraScale+ FPGA, which is compatible with hardware trigger requirements at the CERN Large Hadron Collider experiments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently implement the Transformer architecture for low - latency machine - learning inference in particle physics experiments, especially in the hardware trigger system of the Large Hadron Collider (LHC). Specifically, the paper focuses on implementing the key components of the Transformer model on FPGA, such as the Multi - Head Attention (MHA) mechanism and the Softmax layer, and applying them to the Flavor Tagging task in particle physics, that is, identifying heavy quark (such as bottom quark b and charm quark c) jets and light quark or gluon jets from jets. ### Main problems: 1. **Low - latency requirement**: The hardware trigger system of the LHC needs to complete event selection within an extremely short time (usually a few microseconds), so an efficient low - latency inference algorithm is required. 2. **Limited computing resources**: The computing resources of FPGA are limited. How to implement a complex Transformer model under limited resources is a challenge. 3. **High - precision requirement**: In particle physics experiments, the accuracy of the model directly affects the identification of physical signals and the suppression of background noise. Therefore, a balance needs to be found between low latency and high precision. ### Solutions: - **Efficient implementation of multi - head attention mechanism**: The paper describes in detail how to implement the MHA layer on FPGA. Through a phased pipeline design, the performance is optimized and the latency is reduced. - **Quantization technology**: By reducing the numerical precision of model parameters and inputs and using fixed - point number representation, the consumption of computing resources is reduced while maintaining relatively high model performance. - **Parallelization optimization**: Use the "reuse factor" parameter of the hls4ml tool to control the degree of parallelization and balance the utilization rate of computing resources and latency. ### Experimental verification: - **Dataset**: Use the publicly available dataset of the CMS experiment, which contains jets produced by top - quark - pair decays. These jets are marked as bottom - quark jets, charm - quark jets, and light - quark or gluon jets. - **Performance evaluation**: The model was implemented on Xilinx UltraScale+ FPGA, and an inference latency of less than 2 microseconds was recorded, meeting the strict time requirements of the LHC hardware trigger system. ### Conclusion: The paper successfully implemented the Transformer architecture on FPGA and found a balance between low latency and high precision through optimization techniques. This result is not only applicable to the hardware trigger system of the LHC, but also has broad application prospects and can be extended to other real - time detection systems that require low - latency and high - throughput inference.

Ultra Fast Transformers on FPGAs for Particle Physics Experiments

Fast inference of deep neural networks in FPGAs for particle physics

Ultrafast jet classification on FPGAs for the HL-LHC

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

Fast Neural Network Inference on FPGAs for Triggering on Long-Lived Particles at Colliders

HPTA: A High Performance Transformer Accelerator Based on FPGA

Fast convolutional neural networks on FPGAs with hls4ml

FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

FET-OPU: A Flexible and Efficient FPGA-Based Overlay Processor for Transformer Networks

Hough Transform FPGA solution for High Energy Physics online fast tracking

TransFRU: Efficient Deployment of Transformers on FPGA with Full Resource Utilization

Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics

FPGA-accelerated machine learning inference as a service for particle physics computing

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

Accommodating Transformer Onto FPGA

HPCNeuroNet: A Neuromorphic Approach Merging SNN Temporal Dynamics with Transformer Attention for FPGA-based Particle Physics

LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks

A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE

Machine Learning for Real-Time Processing of ATLAS Liquid Argon Calorimeter Signals with FPGAs

Architectural Solutions for High-Speed Data Processing Demands of CERN LHC Detectors with FPGA and High-Level Synthesis

Embedded FPGA developments in 130 nm and 28 nm CMOS for machine learning in particle detector readout