Abstract:Topic modeling can reveal the latent structure of text data and is useful for knowledge discovery, search relevance ranking, document classification, and so on. One of the major challenges in topic modeling is to deal with large datasets and large numbers of topics in real-world applications. In this paper, we investigate techniques for scaling up the non-probabilistic topic modeling approaches such as RLSI and NMF. We propose a general topic modeling method, referred to as Group Matrix Factorization (GMF), to enhance the scalability and efficiency of the non-probabilistic approaches. GMF assumes that the text documents have already been categorized into multiple semantic classes, and there exist class-specific topics for each of the classes as well as shared topics across all classes. Topic modeling is then formalized as a problem of minimizing a general objective function with regularizations and/or constraints on the class-specific topics and shared topics. In this way, the learning of class-specific topics can be conducted in parallel, and thus the scalability and efficiency can be greatly improved. We apply GMF to RLSI and NMF, obtaining Group RLSI (GRLSI) and Group NMF (GNMF) respectively. Experiments on a Wikipedia dataset and a real-world web dataset, each containing about 3 million documents, show that GRLSI and GNMF can greatly improve RLSI and NMF in terms of scalability and efficiency. The topics discovered by GRLSI and GNMF are coherent and have good readability. Further experiments on a search relevance dataset, containing 30,000 labeled queries, show that the use of topics learned by GRLSI and GNMF can significantly improve search relevance.

Graph Regularized Non-negative Matrix Factorization with Long-tail Constraint

Hierarchical Topic Modeling with Nested Hierarchical Dirichlet Process

Modeling Both Coarse-Grained and Fine-Grained Topics in Massive Text Data

Two to Five Truths in Non-Negative Matrix Factorization

Topic Modeling with Network Regularization

Regularizaed Extraction of Non-Negative Latent Factors from High-Dimensional Sparse Matrices.

Probabilistic Non-Negative Matrix Factorization and Its Robust Extensions for Topic Modeling.

Topic Splitting: A Hierarchical Topic Model Based on Non-Negative Matrix Factorization

A diversifying hidden units method based on NMF for document representation

Group Matrix Factorization for Scalable Topic Modeling

GSLDA: Supervised topic model with graph regularization

External Information Enhancing Topic Model Based on Graph Neural Network

Affinity Regularized Non-Negative Matrix Factorization for Lifelong Topic Modeling

Topic Model for Graph Mining Based on Hierarchical Dirichlet Process

STMLRC: Sparse Topic Model with Low Rank Constraint

An Improved Regularized Latent Semantic Indexing with L1/2 Regularization and Non-negative Constraints

Deep NMF topic modeling

Topics Modeling Based on Selective Zipf Distribution

GraphBTM: Graph Enhanced Autoencoded Variational Inference for Biterm Topic Model.

Multi-Dimension Topic Mining Based on Hierarchical Semantic Graph Model

Contrastive Topic Evolution Discovery Via Nonnegative Matrix Factorization