Abstract:Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this article, we introduce a semantic-based topic model called semantic-LDA that captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection, because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this article makes a significant contribution by introducing a semantic-based topic model that calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.

Using Word Sense As a Latent Variable in LDA Can Improve Topic Modeling.

Statistical Word Sense Aware Topic Models

Topic Models Incorporating Statistical Word Senses

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

A Semantics-enhanced Topic Modelling Technique: Semantic-LDA

Probabilistic Word Selection Via Topic Modeling

On the Semantics of LM Latent Space: A Vocabulary-defined Approach

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning

Efficient Probabilistic Latent Semantic Analysis with Sparsity Control

Source-LDA: Enhancing probabilistic topic models using prior knowledge sources

Document Clustering Based on Word Sense Cluster

Topic-weak-correlated Latent Dirichlet Allocation

Parsimonious Topic Models with Salient Word Discovery

AutoSense Model for Word Sense Induction

A Novel Topic Model for Documents by Incorporating Semantic Relations Between Words

A Unified Model for Word Sense Representation and Disambiguation.

Topic Analysis for Text with Side Data

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Bag-of-Discriminative-Words (BoDW) Representation via Topic Modeling.

Inducing Word Senses for Cross-lingual Document Clustering