Abstract:Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.

Research on Chinese multi-document hierarchical topic modeling automatic evaluation methods

Topic Discovery Based on LDA_col Model and Topic Significance Re-ranking.

Hierarchical Topic Modeling with Nested Hierarchical Dirichlet Process

Update Summarization using a Multi-level Hierarchical Dirichlet Process Model.

A Knowledge-Based Semisupervised Hierarchical Online Topic Detection Framework.

A LDA Topic Model Based Collection Selection Method for Distributed Information Retrieval

A Topic Model for Hierarchical Documents

Hierarchical Topic Integration Through Semi-Supervised Hierarchical Topic Modeling.

Hierarchical Latent Semantic Mapping for Automated Topic Generation

Semi-supervised Labeled Hierarchical Dirichlet Process Topic Model for Document Categorization

Interactive Topic Modeling Based on Hierarchical Dirichlet Process

Automatic Labeling Hierarchical Topics

Multi-grained Document Modeling for Search Result Diversification

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models

Supervised cross-collection topic modeling.

Scalable Training of Hierarchical Topic Models

Dirichlet Mixture Allocation for Multiclass Document Collections Modeling

SSHLDA: A Semi-Supervised Hierarchical Topic Model.

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A Topic-sensitive Extractive Method for Multi-document Summarization

SentTopic-MultiRank: a Novel Ranking Model for Multi-Document Summarization.