Abstract:Background & Objective: Biomedical text data are increasingly available for research. Tokenization is an initial step in many biomedical text mining pipelines. Tokenization is the process of parsing an input biomedical sentence (represented as a digital character sequence) into a discrete set of word/token symbols, which convey focused semantic/syntactic meaning. The objective of this study is to explore variation in tokenizer outputs when applied across a series of challenging biomedical sentences. Method: Diaz [2015] introduce 24 challenging example biomedical sentences for comparing tokenizer performance. In this study, we descriptively explore variation in outputs of eight tokenizers applied to each example biomedical sentence. The tokenizers compared in this study are the NLTK white space tokenizer, the NLTK Penn Tree Bank tokenizer, Spacy and SciSpacy tokenizers, Stanza/Stanza-Craft tokenizers, the UDPipe tokenizer, and R-tokenizers. Results: For many examples, tokenizers performed similarly effectively; however, for certain examples, there were meaningful variation in returned outputs. The white space tokenizer often performed differently than other tokenizers. We observed performance similarities for tokenizers implementing rule-based systems (e.g. pattern matching and regular expressions) and tokenizers implementing neural architectures for token classification. Oftentimes, the challenging tokens resulting in the greatest variation in outputs, are those words which convey substantive and focused biomedical/clinical meaning (e.g. x-ray, IL-10, TCR/CD3, CD4+ CD8+, and (Ca2+)-regulated). Conclusion: When state-of-the-art, open-source tokenizers from Python and R were applied to a series of challenging biomedical example sentences, we observed subtle variation in the returned outputs.

Two Counterexamples to \textit{Tokenization and the Noiseless Channel}

Tokenization Is More Than Compression

Understanding and Mitigating Tokenization Bias in Language Models

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Getting the most out of your tokenizer for pre-training and domain adaptation

Retrofitting (Large) Language Models with Dynamic Tokenization

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Byte BPE Tokenization as an Inverse string Homomorphism

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer

The Foundations of Tokenization: Statistical and Computational Concerns

ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

Tokenization as Finite-State Transduction

Language Model Tokenizers Introduce Unfairness Between Languages

Are Some Words Worth More than Others?

Comparing Variation in Tokenizer Outputs Using a Series of Problematic and Challenging Biomedical Sentences

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Tokenization Falling Short: On Subword Robustness in Large Language Models

NAST: Noise Aware Speech Tokenization for Speech Language Models