Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture

Anil Ahmed,Degen Huang,Syed Yasser Arafat,Imran Hameed

DOI: https://doi.org/10.1145/3648362

IF: 1.471

2024-02-15

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract:Named Entity Recognition (NER) is an indispensable component of Natural Language Processing (NLP), which aims to identify and classify entities within text data. While Deep Learning (DL) models have excelled in NER for well-resourced languages like English, Spanish, and Chinese, they face significant hurdles when dealing with low-resource languages like Urdu. These challenges stem from the intricate linguistic characteristics of Urdu, including morphological diversity, context-dependent lexicon, and the scarcity of training data. This study addresses these issues by focusing on Urdu Named Entity Recognition (U-NER) and introducing three key contributions. First, various pre-trained embedding methods are employed, encompassing Word2vec (W2V), GloVe, FastText, Bidirectional Encoder Representations from Transformers (BERT), and Embeddings from language models (ELMo). In particular, fine-tuning is performed on BERT BASE and ELMo using Urdu Wikipedia and news articles. Secondly, a novel generative Data Augmentation (DA) technique replaces Named Entities (NEs) with mask tokens, employing pre-trained masked language models to predict masked tokens, effectively expanding the training dataset. Finally, the study introduces a novel hybrid model combining a Transformer Encoder with a Convolutional Neural Network (CNN) to capture the intricate morphology of Urdu. These modules enable the model to handle polysemy, extract short and long-range dependencies, and enhance learning capacity. Empirical experiments demonstrate that the proposed model, incorporating BERT embeddings and an innovative DA approach, attains the highest F1-Score of 93.99%, highlighting its efficacy for the U-NER task.

computer science, artificial intelligence

What problem does this paper attempt to address?

The paper aims to address the challenges in Urdu Named Entity Recognition (U-NER). Despite the excellent performance of deep learning models in resource-rich languages such as English, Spanish, and Chinese, they face significant obstacles when dealing with low-resource languages like Urdu. These challenges stem from Urdu's complex morphological features, context-dependent vocabulary, and the scarcity of training data. To tackle these issues, the study proposes the following three main contributions: 1. **Pre-trained Embedding Methods**: Utilizing various pre-trained embedding methods, including Word2vec, GloVe, FastText, BERT, and ELMo, and fine-tuning BERT and ELMo using Urdu Wikipedia and news article datasets. 2. **Data Augmentation Techniques**: Proposing a novel data augmentation method by replacing named entities with mask tokens and using a pre-trained masked language model to predict the mask tokens, thereby effectively expanding the training dataset. 3. **Hybrid Encoder-CNN Architecture**: Introducing a new hybrid model that combines Transformer encoders and Convolutional Neural Networks (CNN) to capture Urdu's complex morphological features. This architecture enables the model to handle polysemy, extract short-range and long-range dependencies, and enhance learning capabilities. Experimental results show that the proposed model, combined with BERT embeddings and innovative data augmentation methods, achieved the highest F1 score (93.99%) on the U-NER task, highlighting its effectiveness.

Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture

A deep learning approach for Named Entity Recognition in Urdu language

Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity Recognition

Enhancing Low Resource NER Using Assisting Language And Transfer Learning

Classical Arabic Named Entity Recognition Using Variant Deep Neural Network Architectures and BERT

UTRNet: High-Resolution Urdu Text Recognition In Printed Documents

ET-Network: A novel efficient transformer deep learning model for automated Urdu handwritten text recognition

Bidirectional Encoder–Decoder Model for Arabic Named Entity Recognition

What Matters for Neural Cross-Lingual Named Entity Recognition: An Empirical Analysis

A deep neural network-based model for named entity recognition for Hindi language

A Novel Deep Auto-Encoder Based Linguistics Clustering Model for Social Text

Contextually Enriched Meta-Learning Ensemble Model for Urdu Sentiment Analysis

mucAI at WojoodNER 2024: Arabic Named Entity Recognition with Nearest Neighbor Search

Neural Named Entity Recognition from Subword Units

Clinical presentation and mutation analysis of VHL disease in a large Chinese family

Efficient Urdu Caption Generation using Attention based LSTM

BE-BLC: BERT-ELMO-Based Deep Neural Network Architecture for English Named Entity Recognition Task

Towards Lingua Franca Named Entity Recognition with BERT

Sentiment Analysis Based on Urdu Reviews Using Hybrid Deep Learning Models

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Improving Feature Extraction Using a Hybrid of CNN and LSTM for Entity Identification