Abstract:Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely Bag of Biterms Modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of Bag of Biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via Bag of Biterms, (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g, Bag of Words, tf-idf) even for normal texts.

Short Text Model Based on Strong Feature Thesaurus

Short Text Classification Based on Strong Feature Thesaurus

Short Text Understanding by Leveraging Knowledge into Topic Model.

Improving Short Text Classification Through Better Feature Space Selection

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning

Improving short text classification using public search engines

Short Text Topic Modeling With Flexible Word Patterns

Modeling over Short Texts

TSSE-DMM: Topic Modeling for Short Texts Based on Topic Subdivision and Semantic Enhancement

A Short-Text Similarity Model Combining Semantic and Syntactic Information

Research on Improve Topic Representation over Short Text.

A Joint Model Of Extended Lda And Ibtm Over Streaming Chinese Short Texts

A biterm topic model for short texts

Biterm Pseudo Document Topic Model for Short Text

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Parsimonious Topic Models with Salient Word Discovery

Exploiting Global Semantic Similarity Biterms for Short-Text Topic Discovery

Bag of biterms modeling for short texts

Incorporating Biterm Correlation Knowledge into Topic Modeling for Short Texts

BTM: Topic Modeling over Short Texts

Short Text Classification Model based on Pre-trained Language Model with Feature Fusion