Abstract:Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable texts. We define multilingual probabilistic topic modeling (MuPTM) and present the first full overview of the current research, methodology, advantages and limitations in MuPTM. As a representative example, we choose a natural extension of the omnipresent LDA model to multilingual settings called bilingual LDA (BiLDA). We provide a thorough overview of this representative multilingual model from its high-level modeling assumptions down to its mathematical foundations. We demonstrate how to use the data representation by means of output sets of (i) per-topic word distributions and (ii) per-document topic distributions coming from a multilingual probabilistic topic model in various real-life cross-lingual tasks involving different languages, without any external language pair dependent translation resource: (1) cross-lingual event-centered news clustering, (2) cross-lingual document classification, (3) cross-lingual semantic similarity, and (4) cross-lingual information retrieval. We also briefly review several other applications present in the relevant literature, and introduce and illustrate two related modeling concepts: topic smoothing and topic pruning. In summary, this article encompasses the current research in multilingual probabilistic topic modeling. By presenting a series of potential applications, we reveal the importance of the language-independent and language pair independent data representations by means of MuPTM. We provide clear directions for future research in the field by providing a systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingual applications.

Multilingual transformer and BERTopic for short text topic modeling: The case of Serbian

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Novi jezički modeli za srpski jezik

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Unveiling the Potential of BERTopic for Multilingual Fake News Analysis -- Use Case: Covid-19

Multilingual and Multimodal Topic Modelling with Pretrained Embeddings

Cross-lingual Transfer of Sentiment Classifiers

Multilingual text categorization and sentiment analysis: a comparative analysis of the utilization of multilingual approaches for classifying twitter data

Probing Pretrained Language Models for Lexical Semantics

Multi-task Learning for Cross-Lingual Sentiment Analysis

A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts

Towards Fully Bilingual Deep Language Modeling

Probabilistic Topic Modeling in Multilingual Settings: an Overview of Its Methodology and Applications.

A Survey of Resources and Methods for Natural Language Processing of Serbian Language

Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian

Empowering Interdisciplinary Research with BERT-Based Models: An Approach Through SciBERT-CNN with Topic Modeling

New Textual Corpora for Serbian Language Modeling

T-BERT -- Model for Sentiment Analysis of Micro-blogs Integrating Topic Model and BERT

Evaluating Transferability of BERT Models on Uralic Languages

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario