Abstract:Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at <a class="link-external link-https" href="https://github.com/segment-any-text/wtpsplit" rel="external noopener nofollow">this https URL</a> under the MIT license.

An automatic approach for efficient text segmentation

An adaptive method for text domain similarity calculation

Text structure partition for automatic summarization

Study on Topic Partition in Automatic Abstracting System

Study on Topic Partition Based on Sequential Paragraphic Similarity

Multi-documents Automatic Abstracting Based on Text Clustering and Semantic Analysis

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Using multiple discriminant analysis approach for linear text segmentation

Segmentation of Chinese Discourse in Content-Based Information Retrieval.

A Mixed Approach to Book Splitting.

OntoSeg: a Novel Approach to Text Segmentation using Ontological Similarity

Optimizing temporal topic segmentation for intelligent text visualization.

Automatic summarization oriented Chinese word extraction and statistics system

Advances in domain independent linear text segmentation

Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark

Automatic Labeling of Topic Models Using Text Summaries

Unsupervised segmentation of chinese corpus using accessor variety

Clustering Text Data Streams

Topic Decomposition and Summarization.

TopWORDS-Seg: Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference

Recent Trends in Linear Text Segmentation: a Survey