Abstract:Natural language processing (NLP) has experienced rapid advancements with the rise of deep learning, significantly outperforming traditional rule-based methods. By capturing hidden patterns and underlying structures within data, deep learning has improved performance across various NLP tasks, overcoming the limitations of rule-based systems. However, most research and development in NLP has been concentrated on a select few languages, primarily those with large numbers of speakers or financial significance, leaving many others underexplored. This lack of research is often attributed to the scarcity of adequately annotated datasets essential for training deep learning models. Despite this challenge, there is potential in leveraging the linguistic similarities between unexplored and well-studied languages, particularly those in close geographic and linguistic proximity. This thesis investigates the application of transfer learning for Part-of-Speech (POS) tagging between Hindi and Nepali, two highly similar languages belonging to the Indo-Aryan language family. Specifically, the work explores whether joint training of a POS tagging model for both languages enhances performance. Additionally, we assess whether multitask learning in Hindi, with auxiliary tasks such as gender and singular/plural tagging, can contribute to improved POS tagging accuracy. The deep learning architecture employed is the BLSTM-CNN-CRF model, trained under different conditions: monolingual word embeddings, vector-mapped embeddings, and jointly trained Hindi-Nepali word embeddings. Varying dropout rates (0.25 to 0.5) and optimizers (ADAM and AdaDelta) are also evaluated. Results indicate that jointly trained Hindi-Nepali word embeddings improve performance across all models compared to monolingual and vector-mapped embeddings.

Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Fine-Tuning Small Embeddings for Elevated Performance

Development of Pre-Trained Transformer-based Models for the Nepali Language

A Systematic Analysis of Vocabulary and BPE Settings for Optimal Fine-tuning of NMT: A Case Study of In-domain Translation

Whisper Finetuning on Nepali Language

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

SymNoise: Advancing Language Model Fine-tuning with Symmetric Noise

Tokenization Falling Short: On Subword Robustness in Large Language Models

On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi

Retrofitting (Large) Language Models with Dynamic Tokenization

Understanding and Mitigating Tokenization Bias in Language Models

Getting the most out of your tokenizer for pre-training and domain adaptation

A Kernel-Based View of Language Model Fine-Tuning

Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models

Exploring transfer learning for Deep NLP systems on rarely annotated languages

Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

Optimizing Segmentation Granularity for Neural Machine Translation

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models