Abstract:Short texts are prevalent on today's Web, especially with the emergence of social media. However, how to discover the topics of streaming short texts has become an important task for many content analysis applications. Conventional topic models such as Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) will suffer from sparsity problem when we infer the latent topics from short texts with them. The reason is that they derive topics from document-level word co-occurrence by modeling each document as a mixture of topics. Different from the above idea, Biterm Topic Model (BTM) discovers topics in short texts by directly modeling the generation of word co-occurrence patterns in the whole corpus. But semantic information is lacking for short texts. In this paper, in order to alleviate the sparsity problem, keep the semantic information of documents and get the latent topic information of streaming short texts immediately, we propose a joint topic model for Chinese streaming short texts (CTM) based on the online algorithms of LDA and BTM. Experiments on short texts from Sina Weibo show that our joint topic model can discover more precise topics and carry out more applications. In addition, considering the preprocessing in Chinese text is different from English and errors in extracting key phrases, we use a combined word method to extend the length of short texts and reduce errors in extracting key phrases.

Applications of Statistical Models in Chinese Text Mining

A Method for Chinese Text Classification Based on Three-Dimensional Vector Space Model

A Joint Model Of Extended Lda And Ibtm Over Streaming Chinese Short Texts

A Comparative Study on Chinese Word Segmentation Using Statistical Models

Topic Discovery for Streaming Short Texts with CTM.

Topic Detection Technology for Chinese Text Based on Statistics and Semantic Information

TSSE-DMM: Topic Modeling for Short Texts Based on Topic Subdivision and Semantic Enhancement

A hybrid Chinese word segmentation model for quality management-related texts based on transfer learning

A discriminative model selection approach and its application to text classification

A Self-adaptive Sliding Window Based Topic Model for Non-uniform Texts

Bayesian Text Classification and Summarization Via A Class-Specified Topic Model.

Toward a Unified Approach to Statistical Language Modeling for Chinese

Research on Chinese Text Classification Based on WAE and SVM

Chinese News Text Classification Based on Machine Learning Algorithm

A Joint Model for Unsupervised Chinese Word Segmentation.

Text Metric Method on Statistical Manifold Learning

Research on Correlation of China Word Attribution

On the Unsupervised Analysis of Domain-Specific Chinese Texts

Experimental Study of Chinese Free-Text IE Algorithm Based on WCA-Selection Using Hidden Markov Model

Automatic Labeling of Topic Models Using Text Summaries

Clustering of Chinese Sentences Using the SMM Model