EnML: Multi-label Ensemble Learning for Urdu Text Classification

Faiza Mehmood,Rehab Shahzadi,Hina Ghafoor,Muhammad Nabeel Asim,Muhammad Usman Ghani,Waqar Mahmood,Andreas Dengel
DOI: https://doi.org/10.1145/3616111
IF: 1.471
2023-08-28
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:Exponential growth of electronic data requires advanced multi-label classification approaches for the development of natural language processing (NLP) applications such as recommendation systems, drug reaction detection, hate speech detection and opinion recognition/mining. To date, several machine and deep learning based multi-label classification methodologies have been proposed for English, French, German, Chinese, Arabic and other developed languages. Urdu is the 11 th largest language in the world and has no computer aided multi-label textual news classification approach. Unlike, other languages, Urdu language is lacking multi-label text classification datasets that can be used to benchmark the performance of existing machine and deep learning methodologies. With an aim to accelerate and expedite research for the development of Urdu multi-label text classification based applications, contributions of this paper are multifarious, firstly, it provides a manually annotated multi-label textual news classification dataset for the Urdu language. Second, it benchmarks the performance of traditional machine learning approaches particularly by adapting three data transformation approaches along with three top-performing machine learning classifiers and four algorithm adaptation based approaches. Thirdly, it benchmarks performance of 16 existing deep learning approaches and 4 most widely used language models. Finally, it provides an ensemble approach that reaps the benefits of three different deep learning architectures to precisely predict different classes associated with a particular Urdu textual document. Experimental results reveal that proposed ensemble approach performance values (87% accuracy, 92% f1 score, and 8% hamming loss) are significantly higher than adapted machine and deep learning based approaches.
computer science, artificial intelligence
What problem does this paper attempt to address?