Highly Fast Text Segmentation With Pairwise Markov Chains

Elie Azeraf,Emmanuel Monfrini,Emmanuel Vignon,Wojciech Pieczynski
DOI: https://doi.org/10.1109/CiSt49399.2021.9357304
2021-02-18
Abstract:Natural Language Processing (NLP) models' current trend consists of using increasingly more extra-data to build the best models as possible. It implies more expensive computational costs and training time, difficulties for deployment, and worries about these models' carbon footprint reveal a critical problem in the future. Against this trend, our goal is to develop NLP models requiring no extra-data and minimizing training time. To do so, in this paper, we explore Markov chain models, Hidden Markov Chain (HMC) and Pairwise Markov Chain (PMC), for NLP segmentation tasks. We apply these models for three classic applications: POS Tagging, Named-Entity-Recognition, and Chunking. We develop an original method to adapt these models for text segmentation's specific challenges to obtain relevant performances with very short training and execution times. PMC achieves equivalent results to those obtained by Conditional Random Fields (CRF), one of the most applied models for these tasks when no extra-data are used. Moreover, PMC has training times 30 times shorter than the CRF ones, which validates this model given our objectives.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to develop models that can be trained quickly and have a short execution time in natural language processing (NLP) tasks without using additional data. Specifically, the author focuses on three classic applications in text segmentation tasks: part - of - speech tagging (POS Tagging), named - entity - recognition (Named - Entity - Recognition, NER) and chunking. The current trend is to use more and more additional data to build the best models, which leads to higher computational costs, longer training times and concerns about the carbon footprint of these models. To address these issues, the author proposes a method based on pairwise Markov chain (PMC) aiming to reduce training time and increase the lightweight level of the model. ### Main Objectives of the Paper: 1. **Reduce Training Time**: By using the PMC model, the author hopes to significantly shorten the training time of the model while maintaining high performance. 2. **Reduce Computational Costs**: Reduce the need for a large amount of computational resources, making the model easier to deploy. 3. **Environment - Friendly**: Reduce the carbon footprint of the model to make it more environmentally friendly. ### Specific Methods: - **Model Selection**: The author selects the pairwise Markov chain (PMC) model and compares it with the traditional hidden Markov chain (HMC) model. - **Task Adaptation**: The author proposes an original method to adapt the PMC model to the specific challenges of text segmentation tasks in order to obtain relevant performance. - **Experimental Verification**: The author conducts experiments on multiple standard datasets, including CoNLL 2000, CoNLL 2003 and UD English, to verify the effectiveness of the PMC model. ### Experimental Results: - **Performance Comparison**: The PMC model achieves results comparable to conditional random fields (CRF) in most experiments, especially performing better on known vocabulary. - **Training Time**: The training time of the PMC model is about 30 times faster than that of the CRF model, and the execution time is also faster. - **Unknown Vocabulary Processing**: Although the PMC performs slightly worse when processing unknown vocabulary, its overall performance is still satisfactory. ### Conclusion: The PMC model can achieve performance comparable to the CRF model without using additional data, while significantly reducing training and execution times. This makes the PMC model a lightweight and efficient text segmentation solution, especially suitable for resource - constrained scenarios.