Abstract:Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unlabeled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.

Automatic Labelling Of Topic Models Using Word Vectors And Letter Trigram Vectors

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

Automatic Labeling Of Topic Models Using Graph-Based Ranking

Automatic Labelling of Topics with Neural Embeddings

Automatic Topic Labeling Using Graph-Based Pre-Trained Neural Embedding

A Topic Label Extraction Method for the University BBS.

Labeled Phrase Latent Dirichlet Allocation

Automatic Labeling of Topic Models Using Text Summaries

A novel label-based multimodal topic model for social media analysis

Topic2Vec: Learning distributed representations of topics

Automatic Labelling Of Topic Models Learned From Twitter By Summarisation

Automatic Labeling Hierarchical Topics

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models

Integrating Topic Modeling with Word Embeddings by Mixtures of Vmfs.

A Weighted Topic Modeling Approach Based on Word Embedding

Using Topic Labels for Text Summarization.

A LDA Model Based Topic Detection Method

A Feature-Word-topic Model for Image Annotation and Retrieval

Exploring Topic Discriminating Power of Words in Latent Dirichlet Allocation.

More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

Effective Document Labeling with Very Few Seed Words: A Topic Model Approach