Abstract:Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The MRF-based algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.

Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods

Exploring simultaneous keyword and key sentence extraction: improve graph-based ranking using wikipedia.

Exploiting Wikipedia As External Knowledge For Document Clustering

A Graph-Based Biomedical Literature Clustering Approach Utilizing Term's Global and Local Importance Information

A knowledge-driven approach to biomedical document conceptualization

Graph-based Semantical Extractive Text Analysis

GRAPH: A domain ontology-driven semantic graph auto extraction system

An Exploration Of Semantic Relations In Neural Word Embeddings Using Extrinsic Knowledge

Ranking Narrative Query Graphs for Biomedical Document Retrieval (Technical Report)

Graph-Based Approach For Cross Domain Text Linking

Semantic Text Mining with Linked Data

Utilization of global ranking information in GraphBased biomedical literature clustering

Improved Automatic Keyword Extraction Given More Semantic Knowledge

Graph-Based Text Similarity Measurement by Exploiting Wikipedia As Background Knowledge

A Text Clustering System Based on K-Means Type Subspace Clustering and Ontology

Multi-Dimension Topic Mining Based on Hierarchical Semantic Graph Model

An Unsupervised Approach for Keyphrase Extraction Using Within-Collection Resources

Graph-Community Detection for Cross-Document Topic Segment Relationship Identification

Effective Medical Archives Processing Using Knowledge Graphs

Document Keyword Extraction Based on Semantic Hierarchical Graph Model

Keyphrase Extraction Using Knowledge Graphs