Abstract:Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as "tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves an approximately 18% improvement in FIM coding benchmarks, consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance (up to 3.7%) over individual models across various standard baselines in reasoning, knowledge, and coding.

From Language Models over Tokens to Language Models over Characters

On the Proper Treatment of Tokenization in Psycholinguistics

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Revisiting Character-level Adversarial Attacks for Language Models

Language models scale reliably with over-training and on downstream tasks

Character Eyes: Seeing Language through Character-Level Taggers

Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works

Bridging the Gap for Tokenizer-Free Language Models

Evaluating Language Model Character Traits

Language Model Inversion

To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer

Tokenization Falling Short: On Subword Robustness in Large Language Models

Retrofitting (Large) Language Models with Dynamic Tokenization

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

Tokenizer Choice For LLM Training: Negligible or Crucial?

Counterfactual Token Generation in Large Language Models

Character is Destiny: Can Large Language Models Simulate Persona-Driven Decisions in Role-Playing?

Thinking Tokens for Language Modeling

Adapting Language Models via Token Translation

Sub-Character Tokenization for Chinese Pretrained Language Models