Abstract:Natural Language Processing (NLP) models' current trend consists of using increasingly more extra-data to build the best models as possible. It implies more expensive computational costs and training time, difficulties for deployment, and worries about these models' carbon footprint reveal a critical problem in the future. Against this trend, our goal is to develop NLP models requiring no extra-data and minimizing training time. To do so, in this paper, we explore Markov chain models, Hidden Markov Chain (HMC) and Pairwise Markov Chain (PMC), for NLP segmentation tasks. We apply these models for three classic applications: POS Tagging, Named-Entity-Recognition, and Chunking. We develop an original method to adapt these models for text segmentation's specific challenges to obtain relevant performances with very short training and execution times. PMC achieves equivalent results to those obtained by Conditional Random Fields (CRF), one of the most applied models for these tasks when no extra-data are used. Moreover, PMC has training times 30 times shorter than the CRF ones, which validates this model given our objectives.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to develop models that can be trained quickly and have a short execution time in natural language processing (NLP) tasks without using additional data. Specifically, the author focuses on three classic applications in text segmentation tasks: part - of - speech tagging (POS Tagging), named - entity - recognition (Named - Entity - Recognition, NER) and chunking. The current trend is to use more and more additional data to build the best models, which leads to higher computational costs, longer training times and concerns about the carbon footprint of these models. To address these issues, the author proposes a method based on pairwise Markov chain (PMC) aiming to reduce training time and increase the lightweight level of the model. ### Main Objectives of the Paper: 1. **Reduce Training Time**: By using the PMC model, the author hopes to significantly shorten the training time of the model while maintaining high performance. 2. **Reduce Computational Costs**: Reduce the need for a large amount of computational resources, making the model easier to deploy. 3. **Environment - Friendly**: Reduce the carbon footprint of the model to make it more environmentally friendly. ### Specific Methods: - **Model Selection**: The author selects the pairwise Markov chain (PMC) model and compares it with the traditional hidden Markov chain (HMC) model. - **Task Adaptation**: The author proposes an original method to adapt the PMC model to the specific challenges of text segmentation tasks in order to obtain relevant performance. - **Experimental Verification**: The author conducts experiments on multiple standard datasets, including CoNLL 2000, CoNLL 2003 and UD English, to verify the effectiveness of the PMC model. ### Experimental Results: - **Performance Comparison**: The PMC model achieves results comparable to conditional random fields (CRF) in most experiments, especially performing better on known vocabulary. - **Training Time**: The training time of the PMC model is about 30 times faster than that of the CRF model, and the execution time is also faster. - **Unknown Vocabulary Processing**: Although the PMC performs slightly worse when processing unknown vocabulary, its overall performance is still satisfactory. ### Conclusion: The PMC model can achieve performance comparable to the CRF model without using additional data, while significantly reducing training and execution times. This makes the PMC model a lightweight and efficient text segmentation solution, especially suitable for resource - constrained scenarios.

Highly Fast Text Segmentation With Pairwise Markov Chains

Hidden Markov Chains, Entropic Forward-Backward, and Part-Of-Speech Tagging

Introducing the Hidden Neural Markov Chain framework

Joint Segmentation and Tagging with Coupled Sequences Labeling

SEGMENT+: Long Text Processing with Short-Context Language Models

Exploring Segment Representations for Neural Semi-Markov Conditional Random Fields

Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

Text4Seg: Reimagining Image Segmentation as Text Generation

Markov Models Applications in Natural Language Processing: A Survey

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Faceted Text Segmentation Via Multitask Learning.

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Large Language Models for Page Stream Segmentation

Scalable and Domain-General Abstractive Proposition Segmentation

Complex text processing by the temporal context machines

Efficient Part-of-Speech Tagging with a Min-Max Modular Neural-Network Model

Multi-Task Cross-Lingual Sequence Tagging from Scratch

Short Text Topic Modeling With Flexible Word Patterns

SegFormer: A Topic Segmentation Model with Controllable Range of Attention.

A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information.