Abstract:Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at <a class="link-external link-https" href="https://github.com/segment-any-text/wtpsplit" rel="external noopener nofollow">this https URL</a> under the MIT license.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of existing sentence segmentation methods when dealing with texts lacking punctuation, having inconsistent formats or containing multiple languages. Specifically, the paper points out that the current methods have the following problems: 1. **Dependence on Punctuation**: Existing sentence - segmentation tools mainly rely on punctuation to identify sentence boundaries. This makes them perform poorly when dealing with texts lacking punctuation. 2. **Poor Adaptability to New Domains**: Existing methods have a significant performance decline when dealing with texts in different domains (such as lyrics, legal documents, etc.). 3. **Low Efficiency**: Some existing methods are slow when dealing with large - scale texts, which affects the efficiency in practical applications. 4. **Insufficient Multilingual Support**: Most existing methods require the specification of the language code of the text, which limits their application in multilingual and code - switching texts. To overcome these problems, the paper proposes a new model - Segment Any Text (SAT), aiming to achieve the following goals: - **Robustness**: Reduce the dependence on punctuation and improve the performance in texts with missing punctuation or inconsistent formats. - **Adaptability**: Be able to effectively adapt to texts in different domains, including lyrics, legal documents, etc. - **High Efficiency**: Improve the speed of processing large - scale texts by optimizing the model structure. - **Multilingual Support**: Support sentence segmentation in multiple languages without specifying the language code. The paper verifies the superiority of the SAT model in these aspects through a series of experiments, especially when dealing with poorly - formatted texts.

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation

Subtitles to Segmentation: Improving Low-Resource Speech-to-Text Translation Pipelines

Scalable and Domain-General Abstractive Proposition Segmentation

An automatic approach for efficient text segmentation

Lexically Grounded Subword Segmentation

From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Building a Word Segmenter for Sanskrit Overnight

TAMS: Translation-Assisted Morphological Segmentation

GiTS: Gist-driven Text Segmentation

Segment and Caption Anything

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

Automating Easy Read Text Segmentation

SEGMENT+: Long Text Processing with Short-Context Language Models

BiSECT: Learning to Split and Rephrase Sentences with Bitexts

Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation

Text Segmentation as a Supervised Learning Task

Weakly-Supervised Text Instance Segmentation