Abstract:Abstract Part of speech (POS) tagging, though considered to be preliminary to any Natural Language Processing (NLP) task, is crucial to account for, especially in low resource language like Khasi that lacks any form of formal corpus. POS tagging is context sensitive. Therefore, the task is challenging. In this paper, we attempt to investigate a deep learning approach to the POS tagging problem in Khasi. A deep learning model called Robustly Optimized BERT Pretraining Approach (RoBERTa) is pretrained for language modelling task. We then create RoBERTa for POS (RoPOS) tagging, a model that performs POS tagging by fine-tuning the pretrained RoBERTa and leveraging its embeddings for downstream POS tagging. The existing tagset that has been designed, customarily, for the Khasi language is employed for this work, and the corresponding tagged dataset is taken as our base corpus. Further, we propose additional tags to this existing tagset to meet the requirements of the language and have increased the size of the existing Khasi POS corpus. Other machine learning and deep learning models have also been tried and tested for the same task, and a comparative analysis is made on the various models employed. Two different setups have been used for the RoPOS model, and the best testing accuracy achieved is 92 per cent. Comparative analysis of RoPOS with the other models indicates that RoPOS outperforms the others when used for inferencing on texts that are outside the domain of the POS tagged training dataset.

Deep Clustering of Text Representations for Supervision-Free Probing of Syntax

Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing

Bird's Eye: Probing for Linguistic Graph Structures with a Simple Information-Theoretic Approach

Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation

Wave to Syntax: Probing spoken language models for syntax

When Does Syntax Mediate Neural Language Model Performance? Evidence from Dropout Probes

Cluster-norm for Unsupervised Probing of Knowledge

A Latent-Variable Model for Intrinsic Probing

Probing a pretrained RoBERTa on Khasi language for POS tagging

Decoding Probing: Revealing Internal Linguistic Structures in Neural Language Models using Minimal Pairs

Probing Pretrained Language Models for Lexical Semantics

A Matter of Framing: The Impact of Linguistic Formalism on Probing Results

Modelling the Lexicon in Unsupervised Part of Speech Induction

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

Syntax-augmented Multilingual BERT for Cross-lingual Transfer

Dynamic Syntax Mapping: A New Approach to Unsupervised Syntax Parsing

A Cascaded Unsupervised Model for PoS Tagging

On Eliciting Syntax from Language Models via Hashing

Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

The Better Your Syntax, the Better Your Semantics? Probing Pretrained Language Models for the English Comparative Correlative