Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity Recognition

Hamza Khalid,Ghulam Murtaza,Qaiser Abbas
DOI: https://doi.org/10.1145/3595861
IF: 1.471
2023-05-04
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:Named entity recognition (NER) is a task of proper noun identification from natural language text and classification into various types such as location, person, and organization. Due to NER's applications in different natural language processing (NLP) tasks, numerous NER approaches and benchmark datasets have been proposed. However, developing NER techniques for low-resource languages is still limited due to the absence of substantial training datasets. Punjabi is a classic example of low resource language. Although various researchers have worked on Punjabi, they focused on the Gurmukhi script. To overcome the challenges in developing NER for the Shahmukhi script, we present an improved technique for Punjabi NER for the Shahmukhi script in this paper. We firstly extend the existing dataset by adding new NER classes by leveraging a novel Pool of Words data augmentation strategy. Our extended dataset has 11,31,509 tokens and 1,25,789 labeled entities with more named entities (NEs) than the older dataset. In the next step, we fine-tuned a transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for the NER task. We performed experiments using the proposed approach on a new and older dataset version, showing that our method achieved competitive results.
computer science, artificial intelligence
What problem does this paper attempt to address?