Multilingual transformer and BERTopic for short text topic modeling: The case of Serbian

Darija Medvecki,Bojana Bašaragin,Adela Ljajić,Nikola Milošević
DOI: https://doi.org/10.1007/978-3-031-50755-7_16
2024-02-05
Abstract:This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language. We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian. We also compared it to LDA and NMF on fully preprocessed text. The experiments were conducted on a dataset of tweets expressing hesitancy toward COVID-19 vaccination. Our results show that with adequate parameter setting, BERTopic can yield informative topics even when applied to partially pre-processed short text. When the same parameters are applied in both prepro-cessing scenarios, the performance drop on partially preprocessed text is minimal. Compared to LDA and NMF, judging by the keywords, BERTopic offers more informative topics and gives novel insights when the number of topics is not limited. The findings of this paper can be significant for re-searchers working with other morphologically rich low-resource languages and short text.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly lie in two aspects: 1. **The influence of pre - processing on the quality of BERTopic topic representation**: Researchers want to explore how pre - processing steps affect the quality of topics generated by BERTopic in morphologically rich languages (such as Serbian). Specifically, they hope to verify whether BERTopic can still generate meaningful topics when applying minimal pre - processing (such as without lemmatization). This is not only related to reducing the degree of manual intervention, but also can avoid errors introduced due to the inaccuracy of lemmatization algorithms, thus maintaining the integrity of the topic structure. 2. **Comparison between BERTopic and traditional topic models (LDA and NMF) on fully pre - processed texts**: Researchers also hope to evaluate the performance of BERTopic compared with traditional topic models (such as LDA and NMF) on fully pre - processed texts through comparative experiments. They pay special attention to whether BERTopic can provide more abundant and informative topic keywords and whether it can give new insights. The solutions to these two problems are of great significance for the research on short - text topic modeling in morphologically rich and resource - limited languages. Through these studies, valuable references can be provided for other researchers, especially when dealing with similar languages and short - text data sets.