Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture

Anil Ahmed,Degen Huang,Syed Yasser Arafat,Imran Hameed
DOI: https://doi.org/10.1145/3648362
IF: 1.471
2024-02-15
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:Named Entity Recognition (NER) is an indispensable component of Natural Language Processing (NLP), which aims to identify and classify entities within text data. While Deep Learning (DL) models have excelled in NER for well-resourced languages like English, Spanish, and Chinese, they face significant hurdles when dealing with low-resource languages like Urdu. These challenges stem from the intricate linguistic characteristics of Urdu, including morphological diversity, context-dependent lexicon, and the scarcity of training data. This study addresses these issues by focusing on Urdu Named Entity Recognition (U-NER) and introducing three key contributions. First, various pre-trained embedding methods are employed, encompassing Word2vec (W2V), GloVe, FastText, Bidirectional Encoder Representations from Transformers (BERT), and Embeddings from language models (ELMo). In particular, fine-tuning is performed on BERT BASE and ELMo using Urdu Wikipedia and news articles. Secondly, a novel generative Data Augmentation (DA) technique replaces Named Entities (NEs) with mask tokens, employing pre-trained masked language models to predict masked tokens, effectively expanding the training dataset. Finally, the study introduces a novel hybrid model combining a Transformer Encoder with a Convolutional Neural Network (CNN) to capture the intricate morphology of Urdu. These modules enable the model to handle polysemy, extract short and long-range dependencies, and enhance learning capacity. Empirical experiments demonstrate that the proposed model, incorporating BERT embeddings and an innovative DA approach, attains the highest F1-Score of 93.99%, highlighting its efficacy for the U-NER task.
computer science, artificial intelligence
What problem does this paper attempt to address?
The paper aims to address the challenges in Urdu Named Entity Recognition (U-NER). Despite the excellent performance of deep learning models in resource-rich languages such as English, Spanish, and Chinese, they face significant obstacles when dealing with low-resource languages like Urdu. These challenges stem from Urdu's complex morphological features, context-dependent vocabulary, and the scarcity of training data. To tackle these issues, the study proposes the following three main contributions: 1. **Pre-trained Embedding Methods**: Utilizing various pre-trained embedding methods, including Word2vec, GloVe, FastText, BERT, and ELMo, and fine-tuning BERT and ELMo using Urdu Wikipedia and news article datasets. 2. **Data Augmentation Techniques**: Proposing a novel data augmentation method by replacing named entities with mask tokens and using a pre-trained masked language model to predict the mask tokens, thereby effectively expanding the training dataset. 3. **Hybrid Encoder-CNN Architecture**: Introducing a new hybrid model that combines Transformer encoders and Convolutional Neural Networks (CNN) to capture Urdu's complex morphological features. This architecture enables the model to handle polysemy, extract short-range and long-range dependencies, and enhance learning capabilities. Experimental results show that the proposed model, combined with BERT embeddings and innovative data augmentation methods, achieved the highest F1 score (93.99%) on the U-NER task, highlighting its effectiveness.