Comparison of Topic Modelling Approaches in the Banking Context

Bayode Ogunleye,Tonderai Maswera,Laurence Hirsch,Jotham Gaudoin,Teresa Brunsdon
DOI: https://doi.org/10.3390/app13020797
2024-02-06
Abstract:Topic modelling is a prominent task for automatic topic extraction in many applications such as sentiment analysis and recommendation systems. The approach is vital for service industries to monitor their customer discussions. The use of traditional approaches such as Latent Dirichlet Allocation (LDA) for topic discovery has shown great performances, however, they are not consistent in their results as these approaches suffer from data sparseness and inability to model the word order in a document. Thus, this study presents the use of Kernel Principal Component Analysis (KernelPCA) and K-means Clustering in the BERTopic architecture. We have prepared a new dataset using tweets from customers of Nigerian banks and we use this to compare the topic modelling approaches. Our findings showed KernelPCA and K-means in the BERTopic architecture-produced coherent topics with a coherence score of 0.8463.
Information Retrieval,Artificial Intelligence,Machine Learning,Computation
What problem does this paper attempt to address?
This paper mainly discusses the comparison of different topic modeling methods in the banking environment. In the study, the authors proposed using Kernel Principal Component Analysis (KernelPCA) and K-means clustering combined with the BERTopic architecture to extract topics, addressing the issues of data sparsity and inability to consider word order in traditional methods like Latent Dirichlet Allocation (LDA). They created a new dataset containing tweets from Nigerian bank customers and used this data to compare different topic modeling methods. The results showed that the topics generated by KernelPCA and K-means clustering under the BERTopic architecture had high coherence, with a coherence score of 0.8463. Although traditional methods like LDA have performed well in the past, their results were inconsistent. Therefore, this study aimed to experimentally compare and validate the latest topic modeling models and apply these techniques in the context of the Nigerian banking industry. The paper also reviews the history of topic modeling, from Latent Semantic Indexing (LSI) to transformer-based language models like BERT, and highlights the advantages and limitations of each method, particularly in handling data sparsity issues in social media texts and short texts. Finally, the paper introduces the experimental methods, including data preprocessing, algorithms used, and evaluation metrics such as coherence score, and demonstrates the performance of different models.