Scaling Particle Collision Data Analysis

Hengkui Wu,Panpan Chi,Yongfeng Zhu,Liujiang Liu,Shuyang Hu,Yuexin Wang,Chen Zhou,Qihao Wang,Yingsi Xin,Bruce Liu,Dahao Liang,Xinglong Jia,Manqi Ruan
2024-11-28
Abstract:For decades, researchers have developed task-specific models to address scientific challenges across diverse disciplines. Recently, large language models (LLMs) have shown enormous capabilities in handling general tasks; however, these models encounter difficulties in addressing real-world scientific problems, particularly in domains involving large-scale numerical data analysis, such as experimental high energy physics. This limitation is primarily due to BPE tokenization's inefficacy with numerical data. In this paper, we propose a task-agnostic architecture, BBT-Neutron, which employs a binary tokenization method to facilitate pretraining on a mixture of textual and large-scale numerical experimental data. The project code is available at <a class="link-external link-https" href="https://github.com/supersymmetry-technologies/bbt-neutron" rel="external noopener nofollow">this https URL</a>. We demonstrate the application of BBT-Neutron to Jet Origin Identification (JoI), a critical categorization challenge in high-energy physics that distinguishes jets originating from various quarks or gluons. Our results indicate that BBT-Neutron achieves comparable performance to state-of-the-art task-specific JoI models. Furthermore, we examine the scaling behavior of BBT-Neutron's performance with increasing data volume, suggesting the potential for BBT-Neutron to serve as a foundational model for particle physics data analysis, with possible extensions to a broad spectrum of scientific computing applications for Big Science experiments, industrial manufacturing and spacial computing.
Machine Learning,High Energy Physics - Experiment,Data Analysis, Statistics and Probability
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the difficulties encountered by existing large - language models (LLMs) when processing large - scale numerical data, especially in scientific fields such as experimental high - energy physics. Specifically, the paper points out the following: 1. **Limitations of mainstream number representation methods**: Currently, the mainstream Byte - Pair Encoding (BPE) tokenization method has deficiencies when dealing with numerical data. BPE tokenization will split numerical values into arbitrary segments, causing their original meaning to be blurred, and the tokenization of the same numerical value may be inconsistent in different contexts, which further complicates downstream numerical tasks. 2. **Limitations of task - specific models**: For decades, researchers have developed many task - specific models to address scientific challenges. However, these models lack the advantage of transfer learning across multiple tasks and perform poorly in out - of - distribution generalization, which is very detrimental to scientific discovery. 3. **Limitations of base models in scientific tasks**: Existing base models are mainly trained on symbolic data, such as text and scientific symbols (DNA sequences, mathematical formulas), while numerical experimental data generated in large - scale scientific projects (such as data in particle physics and astronomy) are usually excluded because of the lack of a general architecture that can integrate text and experimental data. To solve these problems, the paper proposes a new task - agnostic architecture - Big Bang Transformer - Neutron (BBT - Neutron). This architecture adopts an innovative binary tokenization method, which can directly handle mixed inputs of text and large - scale numerical experimental data, thus achieving performance comparable to the existing state - of - the - art task - specific models in the Jet Origin Identification (JoI) task. ### Specific problems and solutions - **Numerical data processing**: BBT - Neutron encodes the input data into byte sequences through the binary tokenization method, preserving the internal structure and quantitative integrity of the numerical data, and avoiding the ambiguity caused by the splitting or merging of numerical and text information. - **Task - agnostic architecture**: As a task - agnostic large - language model architecture, BBT - Neutron can be pre - trained on multi - modal datasets and support multiple scientific tasks, such as classification, clustering, and regression. - **Scalability and adaptability**: BBT - Neutron shows the ability to improve performance as the amount of data increases, especially when dealing with large - scale experimental data, and has broad application potential. ### Conclusion By introducing the binary tokenization method and the task - agnostic architecture, BBT - Neutron provides a new paradigm for solving the problem of large - scale numerical data analysis in scientific computing, marking an important step towards the development of base models suitable for scientific research.