Abstract:We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM (e.g., DocNADE) and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging. We address this challenge by incorporating external knowledge into neural autoregressive topic models via a language modelling approach: we use word embeddings as input of a LSTM-LM with the aim to improve the word-topic mapping on a smaller and/or short-text corpus. The proposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled with neural LMs and embeddings priors that consistently outperform state-of-the-art generative TMs in terms of generalization (perplexity), interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.

Deep Autoencoding Topic Model With Scalable Hybrid Bayesian Inference

Hierarchical Topic Modeling with Nested Hierarchical Dirichlet Process

Max-Margin Deep Diverse Latent Dirichlet Allocation With Continual Learning

Deep Latent Dirichlet Allocation with Topic-Layer-Adaptive Stochastic Gradient Riemannian MCMC

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

A Deep and Autoregressive Approach for Topic Modeling of Multimodal Data

Nonparametric Topic Modeling with Neural Inference

Scalable Training of Hierarchical Topic Models

A Bayesian Nonparametric Topic Model with Variational Auto-Encoders

textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings

Scalable Inference in Max-Margin Topic Models

Incorporating Knowledge Graph Embeddings into Topic Modeling

Neural Topic Modeling with Bidirectional Adversarial Training

Improving neural topic modeling via Sinkhorn divergence

Constructing Dynamic Topic Models Based on Variational Autoencoder and Factor Graph

Deep de Finetti: Recovering Topic Distributions from Large Language Models

Neural Topic Modeling with Cycle-Consistent Adversarial Training

A Supervised Neural Autoregressive Topic Model for Simultaneous Image Classification and Annotation

Scaling Up Dynamic Topic Models

Topic Modeling with Wasserstein Autoencoders