Scaling Particle Collision Data Analysis
Hengkui Wu,Panpan Chi,Yongfeng Zhu,Liujiang Liu,Shuyang Hu,Yuexin Wang,Chen Zhou,Qihao Wang,Yingsi Xin,Bruce Liu,Dahao Liang,Xinglong Jia,Manqi Ruan
2024-11-28
Abstract:For decades, researchers have developed task-specific models to address scientific challenges across diverse disciplines. Recently, large language models (LLMs) have shown enormous capabilities in handling general tasks; however, these models encounter difficulties in addressing real-world scientific problems, particularly in domains involving large-scale numerical data analysis, such as experimental high energy physics. This limitation is primarily due to BPE tokenization's inefficacy with numerical data. In this paper, we propose a task-agnostic architecture, BBT-Neutron, which employs a binary tokenization method to facilitate pretraining on a mixture of textual and large-scale numerical experimental data. The project code is available at <a class="link-external link-https" href="https://github.com/supersymmetry-technologies/bbt-neutron" rel="external noopener nofollow">this https URL</a>. We demonstrate the application of BBT-Neutron to Jet Origin Identification (JoI), a critical categorization challenge in high-energy physics that distinguishes jets originating from various quarks or gluons. Our results indicate that BBT-Neutron achieves comparable performance to state-of-the-art task-specific JoI models. Furthermore, we examine the scaling behavior of BBT-Neutron's performance with increasing data volume, suggesting the potential for BBT-Neutron to serve as a foundational model for particle physics data analysis, with possible extensions to a broad spectrum of scientific computing applications for Big Science experiments, industrial manufacturing and spacial computing.
Machine Learning,High Energy Physics - Experiment,Data Analysis, Statistics and Probability