Ultra Fast Transformers on FPGAs for Particle Physics Experiments

Zhixing Jiang,Dennis Yin,Elham E Khoda,Vladimir Loncar,Ekaterina Govorkova,Eric Moreno,Philip Harris,Scott Hauck,Shih-Chieh Hsu
2024-02-02
Abstract:This work introduces a highly efficient implementation of the transformer architecture on a Field-Programmable Gate Array (FPGA) by using the \texttt{hls4ml} tool. Given the demonstrated effectiveness of transformer models in addressing a wide range of problems, their application in experimental triggers within particle physics becomes a subject of significant interest. In this work, we have implemented critical components of a transformer model, such as multi-head attention and softmax layers. To evaluate the effectiveness of our implementation, we have focused on a particle physics jet flavor tagging problem, employing a public dataset. We recorded latency under 2 $\mu$s on the Xilinx UltraScale+ FPGA, which is compatible with hardware trigger requirements at the CERN Large Hadron Collider experiments.
Machine Learning,Hardware Architecture,High Energy Physics - Experiment
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently implement the Transformer architecture for low - latency machine - learning inference in particle physics experiments, especially in the hardware trigger system of the Large Hadron Collider (LHC). Specifically, the paper focuses on implementing the key components of the Transformer model on FPGA, such as the Multi - Head Attention (MHA) mechanism and the Softmax layer, and applying them to the Flavor Tagging task in particle physics, that is, identifying heavy quark (such as bottom quark b and charm quark c) jets and light quark or gluon jets from jets. ### Main problems: 1. **Low - latency requirement**: The hardware trigger system of the LHC needs to complete event selection within an extremely short time (usually a few microseconds), so an efficient low - latency inference algorithm is required. 2. **Limited computing resources**: The computing resources of FPGA are limited. How to implement a complex Transformer model under limited resources is a challenge. 3. **High - precision requirement**: In particle physics experiments, the accuracy of the model directly affects the identification of physical signals and the suppression of background noise. Therefore, a balance needs to be found between low latency and high precision. ### Solutions: - **Efficient implementation of multi - head attention mechanism**: The paper describes in detail how to implement the MHA layer on FPGA. Through a phased pipeline design, the performance is optimized and the latency is reduced. - **Quantization technology**: By reducing the numerical precision of model parameters and inputs and using fixed - point number representation, the consumption of computing resources is reduced while maintaining relatively high model performance. - **Parallelization optimization**: Use the "reuse factor" parameter of the hls4ml tool to control the degree of parallelization and balance the utilization rate of computing resources and latency. ### Experimental verification: - **Dataset**: Use the publicly available dataset of the CMS experiment, which contains jets produced by top - quark - pair decays. These jets are marked as bottom - quark jets, charm - quark jets, and light - quark or gluon jets. - **Performance evaluation**: The model was implemented on Xilinx UltraScale+ FPGA, and an inference latency of less than 2 microseconds was recorded, meeting the strict time requirements of the LHC hardware trigger system. ### Conclusion: The paper successfully implemented the Transformer architecture on FPGA and found a balance between low latency and high precision through optimization techniques. This result is not only applicable to the hardware trigger system of the LHC, but also has broad application prospects and can be extended to other real - time detection systems that require low - latency and high - throughput inference.