Abstract:Abstract Part of speech (POS) tagging, though considered to be preliminary to any Natural Language Processing (NLP) task, is crucial to account for, especially in low resource language like Khasi that lacks any form of formal corpus. POS tagging is context sensitive. Therefore, the task is challenging. In this paper, we attempt to investigate a deep learning approach to the POS tagging problem in Khasi. A deep learning model called Robustly Optimized BERT Pretraining Approach (RoBERTa) is pretrained for language modelling task. We then create RoBERTa for POS (RoPOS) tagging, a model that performs POS tagging by fine-tuning the pretrained RoBERTa and leveraging its embeddings for downstream POS tagging. The existing tagset that has been designed, customarily, for the Khasi language is employed for this work, and the corresponding tagged dataset is taken as our base corpus. Further, we propose additional tags to this existing tagset to meet the requirements of the language and have increased the size of the existing Khasi POS corpus. Other machine learning and deep learning models have also been tried and tested for the same task, and a comparative analysis is made on the various models employed. Two different setups have been used for the RoPOS model, and the best testing accuracy achieved is 92 per cent. Comparative analysis of RoPOS with the other models indicates that RoPOS outperforms the others when used for inferencing on texts that are outside the domain of the POS tagged training dataset.

TPOS Tagging Method Based on BiLSTM_CRF Model

Incorporating External POS Tagger for Punctuation Restoration

Experimental Study of Hidden Markov Model Based Part-of-speech Tagging for Chinese Texts

Bidirectional LSTM-CRF Models for Sequence Tagging

A Chinese Part-of-speech Tagging Approach Using Conditional Random Fields

POS Tag-enhanced Coarse-to-fine Attention for Neural Machine Translation

Deep Learning Model for Tamil Part-of-Speech Tagging

Vietnamese Part of Speech Tagging Based on Multi-category Words Disambiguation Model.

Cross-Register Projection for Headline Part of Speech Tagging

Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger

Improving accuracy of Part-of-Speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language

Leveraging Bidirectionl LSTM with CRFs for Pashto tagging

Yunshan Cup 2020: Overview of the Part-of-Speech Tagging Task for Low-resourced Languages

Towards Accurate and Efficient Chinese Part-of-Speech Tagging.

Probing a pretrained RoBERTa on Khasi language for POS tagging

POS-tagging to highlight the skeletal structure of sentences

Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?

Combining Context Features by Canonical Belief Network for Chinese Part-Of-Speech Tagging.

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Deep Learning based UPoS Tagger for Assamese Religious Text

Study on Japanese Word Segmentation and POS Tagging Based on Rules and Statistics