Abstract:Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at <a class="link-external link-https" href="https://github.com/segment-any-text/wtpsplit" rel="external noopener nofollow">this https URL</a> under the MIT license.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Finding Better Subword Segmentation for Neural Machine Translation

The NLP Engine: A Universal Turing Machine for NLP

MorphPiece : A Linguistic Tokenizer for Large Language Models

Tokenization as Finite-State Transduction

The boundaries of meaning: a case study in neural machine translation

Semantic Tokenizer for Enhanced Natural Language Processing

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Pivot-Based Semantic Splicing for Neural Machine Translation.

Enhancing Neural Machine Translation with Semantic Units

Neural Monkey: An Open-source Tool for Sequence Learning

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

knn-seq: Efficient, Extensible kNN-MT Framework

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

Guiding Neural Machine Translation with Retrieved Translation Pieces

Joint tokenization, parsing, and translation

Design of an Open-Source Architecture for Neural Machine Translation