Abstract:For decades, researchers have developed task-specific models to address scientific challenges across diverse disciplines. Recently, large language models (LLMs) have shown enormous capabilities in handling general tasks; however, these models encounter difficulties in addressing real-world scientific problems, particularly in domains involving large-scale numerical data analysis, such as experimental high energy physics. This limitation is primarily due to BPE tokenization's inefficacy with numerical data. In this paper, we propose a task-agnostic architecture, BBT-Neutron, which employs a binary tokenization method to facilitate pretraining on a mixture of textual and large-scale numerical experimental data. The project code is available at <a class="link-external link-https" href="https://github.com/supersymmetry-technologies/bbt-neutron" rel="external noopener nofollow">this https URL</a>. We demonstrate the application of BBT-Neutron to Jet Origin Identification (JoI), a critical categorization challenge in high-energy physics that distinguishes jets originating from various quarks or gluons. Our results indicate that BBT-Neutron achieves comparable performance to state-of-the-art task-specific JoI models. Furthermore, we examine the scaling behavior of BBT-Neutron's performance with increasing data volume, suggesting the potential for BBT-Neutron to serve as a foundational model for particle physics data analysis, with possible extensions to a broad spectrum of scientific computing applications for Big Science experiments, industrial manufacturing and spacial computing.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the difficulties encountered by existing large - language models (LLMs) when processing large - scale numerical data, especially in scientific fields such as experimental high - energy physics. Specifically, the paper points out the following: 1. **Limitations of mainstream number representation methods**: Currently, the mainstream Byte - Pair Encoding (BPE) tokenization method has deficiencies when dealing with numerical data. BPE tokenization will split numerical values into arbitrary segments, causing their original meaning to be blurred, and the tokenization of the same numerical value may be inconsistent in different contexts, which further complicates downstream numerical tasks. 2. **Limitations of task - specific models**: For decades, researchers have developed many task - specific models to address scientific challenges. However, these models lack the advantage of transfer learning across multiple tasks and perform poorly in out - of - distribution generalization, which is very detrimental to scientific discovery. 3. **Limitations of base models in scientific tasks**: Existing base models are mainly trained on symbolic data, such as text and scientific symbols (DNA sequences, mathematical formulas), while numerical experimental data generated in large - scale scientific projects (such as data in particle physics and astronomy) are usually excluded because of the lack of a general architecture that can integrate text and experimental data. To solve these problems, the paper proposes a new task - agnostic architecture - Big Bang Transformer - Neutron (BBT - Neutron). This architecture adopts an innovative binary tokenization method, which can directly handle mixed inputs of text and large - scale numerical experimental data, thus achieving performance comparable to the existing state - of - the - art task - specific models in the Jet Origin Identification (JoI) task. ### Specific problems and solutions - **Numerical data processing**: BBT - Neutron encodes the input data into byte sequences through the binary tokenization method, preserving the internal structure and quantitative integrity of the numerical data, and avoiding the ambiguity caused by the splitting or merging of numerical and text information. - **Task - agnostic architecture**: As a task - agnostic large - language model architecture, BBT - Neutron can be pre - trained on multi - modal datasets and support multiple scientific tasks, such as classification, clustering, and regression. - **Scalability and adaptability**: BBT - Neutron shows the ability to improve performance as the amount of data increases, especially when dealing with large - scale experimental data, and has broad application potential. ### Conclusion By introducing the binary tokenization method and the task - agnostic architecture, BBT - Neutron provides a new paradigm for solving the problem of large - scale numerical data analysis in scientific computing, marking an important step towards the development of base models suitable for scientific research.

Scaling Particle Collision Data Analysis

Graph Neural Networks-based Hybrid Framework For Predicting Particle Crushing Strength

Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques

Data-driven and model-agnostic approach to solving combinatorial assignment problems in searches for new physics

Turbulence in Focus: Benchmarking Scaling Behavior of 3D Volumetric Super-Resolution with BLASTNet 2.0 Data

Petuum: A New Platform for Distributed Machine Learning on Big Data

A data-driven and model-agnostic approach to solving combinatorial assignment problems in searches for new physics

ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity

Big Data in HEP: A comprehensive use case study

Point cloud-based diffusion models for the Electron-Ion Collider

Does your data spark joy? Performance gains from domain upsampling at the end of training

Large-Scale Pretraining and Finetuning for Efficient Jet Classification in Particle Physics

OmniJet-$α$: The first cross-task foundation model for particle physics

NEEBS: Nonexpert large‐scale environment building system for deep neural network

The use of Ethernet in the DataFlow of the ATLAS Trigger & DAQ

The Bearable Lightness of Big Data: Towards Massive Public Datasets in Scientific Machine Learning

The Fundamental Limit of Jet Tagging

Comparative Study of Large Language Model Architectures on Frontier

Long-Short-Range Message-Passing: A Physics-Informed Framework to Capture Non-Local Interaction for Scalable Molecular Dynamics Simulation

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods