Abstract:Probabilistic Latent Semantic Analysis (PLSA) is a fundamental text analysis technique that models each word in a document as a sample from a mixture of topics. PLSA is the precursor of probabilistic topic models including Latent Dirichlet Allocation (LDA). PLSA, LDA and their numerous extensions have been successfully applied to many text mining and retrieval tasks. One important extension of LDA is supervised LDA (sLDA), which distinguishes itself from most topic models in that it is supervised. However, to the best of our knowledge, no prior work extends PLSA in a similar manner sLDA extends LDA by jointly modeling the contents and the responses of documents. In this paper, we propose supervised PLSA (sPLSA) which can efficiently infer latent topics and their factorized response values from the contents and the responses of documents. The major challenge lies in estimating a document's topic distribution which is a constrained probability that is dictated by both the content and the response of the document. To tackle this challenge, we introduce an auxiliary variable to transform the constrained optimization problem to an unconstrained optimization problem. This allows us to derive an efficient Expectation and Maximization (EM) algorithm for parameter estimation. Compared to sLDA, sPLSA converges much faster and requires less hyperparameter tuning, while performing similarly on topic modeling and better in response factorization. This makes sPLSA an appealing choice for latent response analysis such as ranking latent topics by their factorized response values. We apply the proposed sPLSA model to analyze the controversy of bills from the United States Congress. We demonstrate the effectiveness of our model by identifying contentious legislative issues.

Big Data Processing with Probabilistic Latent Semantic Analysis on MapReduce

PPLSA: Parallel Probabilistic Latent Semantic Analysis Based on MapReduce.

P2LSA and P2LSA+: Two Paralleled Probabilistic Latent Semantic Analysis Algorithms Based on the MapReduce Model.

Parallelization and Characterization of Probabilistic Latent Semantic Analysis

Distributed Affinity Propagation Clustering Based on MapReduce

Efficient Probabilistic Latent Semantic Analysis with Sparsity Control

Big Data Quality Prediction in the Process Industry: A Distributed Parallel Modeling Framework

A MapReduce based distributed LSI

Process Data Analytics Via Probabilistic Latent Variable Models: A Tutorial Review

Self-organizing Weighted Incremental Probabilistic Latent Semantic Analysis

Scalable Learning and Probabilistic Analytics of Industrial Big Data Based on Parameter Server: Framework, Methods and Applications

Monitoring and prediction of big process data with deep latent variable models and parallel computing

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

ELM-MapReduce: MapReduce Accelerated Extreme Learning Machine for Big Spatial Data Analysis

Towards Scalable Subgraph Pattern Matching over Big Graphs on MapReduce.

A Comparative Study on Parallel Lda Algorithms in Mapreduce Framework

Large scale microblog mining using distributed MB-LDA.

Efficient Storage and Retrieval of Probabilistic Latent Semantic Information for Information Retrieval

A Semantic++ MapReduce: A Preliminary Report

Dynamic Threshold Model Based Probabilistic Latent Semantic Analysis

Supervised probabilistic latent semantic analysis with applications to controversy analysis of legislative bills