End to End Urdu Abstractive Text Summarization With Dataset and Improvement in Evaluation Metric

Hassan Raza,Waseem Shahzad
DOI: https://doi.org/10.1109/access.2024.3377463
IF: 3.9
2024-03-22
IEEE Access
Abstract:Urdu, being a common language in South Asia, has not received significant attention in terms of language processing compared to more advanced languages. In the field of Natural Language Processing (NLP), the task of text summarization holds great importance due to its ability to comprehend textual content and generate concise summaries. Text summarization can be either extractive or abstractive in nature. While considerable efforts have been made to advance extractive summarization techniques, the limitations associated with it have been extensively explored and explained in the paper. However, the domain of abstractive summarization for the Urdu language remains largely unexplored. The challenges and underlying factors that have impeded progress in this domain have also been addressed. This paper specifically focuses on abstractive summarization of the Urdu language using supervised learning. To accomplish this, a labeled dataset consisting of Urdu text and its abstractive summaries is required. A dataset of Urdu text and its corresponding abstractive summaries has been prepared for the purpose of supervised learning. Additionally, the paper presents the results of summary generation, measured in terms of a rough score. Transformer's encoder-decoder network was employed to generate abstractive summaries in Urdu, yielding a ROUGE-1 score of 25.18 in Urdu text summarization. Moreover, a novel evaluation metric called the "disconnection rate" has been introduced as a context-aware evaluation metric to enhance the assessment of a summary, known as the Context Aware RoBERTa Score.
computer science, information systems,telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?