Abstract:Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this article, we introduce a semantic-based topic model called semantic-LDA that captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection, because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this article makes a significant contribution by introducing a semantic-based topic model that calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.

Statistical Word Sense Aware Topic Models

Topic Models Incorporating Statistical Word Senses

Using Word Sense As a Latent Variable in LDA Can Improve Topic Modeling.

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

Incorporating Probabilistic Knowledge into Topic Models.

A Novel Topic Model for Automatic Term Extraction

A Semantics-enhanced Topic Modelling Technique: Semantic-LDA

A Novel Topic Model for Documents by Incorporating Semantic Relations Between Words

Source-LDA: Enhancing probabilistic topic models using prior knowledge sources

Jointly Discovering Fine-grained and Coarse-grained Sentiments Via Topic Modeling.

AutoSense Model for Word Sense Induction

Probabilistic Word Selection Via Topic Modeling

Parsimonious Topic Models with Salient Word Discovery

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Spectral Methods for Supervised Topic Models

A Unified Model for Word Sense Representation and Disambiguation.

Sentiment Analysis with Global Topics and Local Dependency

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning

A Context-Aware Topic Model for Statistical Machine Translation.

Document Clustering Based on Word Sense Cluster

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering