Discovering Coherent Topics from Urdu Text.

Mubashar Mustafa,Feng Zeng,Hussain Ghulam,Wenjia Li
2021-01-01
Abstract:Topic modeling (TM), detection of theme or aspect from documents is an important text processing method in natural language processing (NLP) for helping users to get in-sights from a large number of documents. In recent years, many unsupervised models have been used in TM, and these models often produce aspects that are not interpretable. To figure out this issue, few semi-supervised methods have been developed that allow users to input some prior domain knowledge to produce coherent aspects. Most of them are well adapted to the English corpus, but there is very little work in Urdu. TM becomes a challenge for Urdu language having their own morphological structure, semantics, and syn-tax. In this paper, we first propose an effective semi-supervised topic model ”Seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA)” for Urdu language. The model is proposed to produce coherent topics dealing with the morphological structure of Urdu language. The proposed Urdu topic model Seeded-ULDA combines preprocessing, seeded-LDA, and Gibbs sampling. Second, we introduce word2vec word embedding in Urdu and discover topics through clustering of semantic space. This work aims to evaluate and compare various topic modeling frameworks in the Urdu news dataset. After comprehensive experiments and evaluation, the results show that word embedding is unable to extract coherent topics in Urdu language. The proposed seeded-ULDA model is more than 39% efficient as compared to existing ULDA model based on coherence measure.
What problem does this paper attempt to address?