Abstract:The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

What problem does this paper attempt to address?

The problem this paper attempts to address is inferring the data mixture proportions in the pre-training data of large language models (LMs). Specifically, the authors propose a novel attack method that reveals the proportion distribution of different categories (such as natural language, programming language, or data sources) in the training data by analyzing the widely used Byte Pair Encoding (BPE) tokenizer. The main contributions of the paper include: 1. **Proposing the data mixture inference task**: Unlike previous membership inference attacks, this paper focuses on inferring the proportions of different categories in the pre-training dataset. 2. **Using BPE tokenizer for inference**: By analyzing the learning process of the BPE tokenizer and its merge rules list, information about token frequencies in the training data can be extracted. 3. **Experimental validation**: The effectiveness of the method is demonstrated in controlled experiments and applied to existing commercial model tokenizers, revealing their data mixture proportions. Experiments on datasets with known mixture proportions show that the method can recover data mixture proportions with high accuracy. Subsequently, the authors applied this method to actual commercial tokenizers, such as the GPT series, LLAMA, and others, revealing new information. For example, the tokenizers of GPT-4O and MISTRAL NEMO have higher multilingual characteristics, while the tokenizer of GPT-3.5 is mainly optimized for code. Additionally, the study found that all studied tokenizers contain 7%-26% book data. In summary, this paper aims to reveal the mixture proportions in the pre-training data of current powerful learning models by analyzing BPE tokenizers, providing a new perspective for understanding the design practices of these models and inspiring future research on data mixture inference.

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

RegMix: Data Mixture as Regression for Language Model Pre-training

BiMix: Bivariate Data Mixing Law for Language Model Pretraining

Understanding and Mitigating Tokenization Bias in Language Models

Byte BPE Tokenization as an Inverse string Homomorphism

Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens

Large Language Model Tokenizer Bias: A Case Study and Solution on GPT-4o

TextMixer: Mixing Multiple Inputs for Privacy-Preserving Inference

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies

Retrofitting (Large) Language Models with Dynamic Tokenization

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Efficient Online Data Mixing For Language Model Pre-Training

Getting the most out of your tokenizer for pre-training and domain adaptation

Byte Pair Encoding is Suboptimal for Language Model Pretraining

Batching BPE Tokenization Merges

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

LBPE: Long-token-first Tokenization to Improve Large Language Models

Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning