Abstract:In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation with the background knowledge in an ontology. There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise. In this paper, we present a novel text clustering method to address these two issues by enriching document representation with Wikipedia concept and category information. We develop two approaches, exact match and relatedness-match, to map text documents to Wikipedia concepts. and further to Wikipedia categories. Then the text documents are clustered based on a similarity metric which combines document content information, concept information as well as category information. The experimental results using the proposed clustering framework on three datasets (20-newsgroup, TDT2. and LA Times) show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.

Document Topic Extraction Based on Wikipedia Category

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Wikipedia Based Approach for Clustering Keyword of Reviews.

Exploiting Wikipedia As External Knowledge For Document Clustering

Exploring simultaneous keyword and key sentence extraction: improve graph-based ranking using wikipedia.

Incorporating Wikipedia Concepts and Categories As Prior Knowledge into Topic Models.

Extracting Domain-Relevant Term Using Wikipedia Based on Random Walk Model

Multi-Level Topical Text Categorization with Wikipedia

Learning Topic Hierarchies For Wikipedia Categories

Wiki3C: exploiting wikipedia for context-aware concept categorization.

TWAG: A Topic-Guided Wikipedia Abstract Generator

Cross-media topic mining on wikipedia.

Keyphrase extraction based on topic relevance and term association

Exploiting Level-Wise Category Links for Semantic Relatedness Computing

A Topic-sensitive Extractive Method for Multi-document Summarization

Concept over Time: the Combination of Probabilistic Topic Model with Wikipedia Knowledge.

TopicMine: User-Guided Topic Mining by Category-Oriented Embedding

Catriple: Extracting Triples from Wikipedia Categories

Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web

Extracting Semantics Relationships Between Wikipedia Categories.

Keyword extraction and clustering for document recommendation in conversations