Abstract:Currently, many intelligence systems contain the texts from multi-sources, e.g., bulletin board system (BBS) posts, tweets and news. These texts can be ``comparative'' since they may be semantically correlated and thus provide us with different perspectives toward the same topics or events. To better organize the multi-sourced texts and obtain more comprehensive knowledge, we propose to study the novel problem of Mutual Clustering on Comparative Texts (MCCT), which aims to cluster the comparative texts simultaneously and collaboratively. The MCCT problem is difficult to address because 1) comparative texts usually present different data formats and structures and thus they are hard to organize, and 2) there lacks an effective method to connect the semantically correlated comparative texts to facilitate clustering them in an unified way. To this aim, in this paper we propose a Heterogeneous Information Network-based Text clustering framework HINT. HINT first models multi-sourced texts (e.g. news and tweets) as heterogeneous information networks by introducing the shared ``anchor texts'' to connect the comparative texts. Next, two similarity matrices based on HINT as well as a transition matrix for cross-text-source knowledge transfer are constructed. Comparative texts clustering are then conducted by utilizing the constructed matrices. Finally, a mutual clustering algorithm is also proposed to further unify the separate clustering results of the comparative texts by introducing a clustering consistency constraint. We conduct extensive experimental on three tweets-news datasets, and the results demonstrate the effectiveness and robustness of the proposed method in addressing the MCCT problem.

A Comparative Study on Text Clustering Methods

A kind of practical fuzzy clustering

Comparison study of using semantic and syntactic network characteristics to do text clustering

A Linguistic Feature Based Text Clustering Method.

A Comparative Study on Representing Units in Chinese Text Clustering

Mutual Clustering on Comparative Texts via Heterogeneous Information Networks

A New Text Clustering Method Using Hidden Markov Model

Comparison of Spectral Clustering, K-clustering and Hierarchical Clustering on E-Nose Datasets: Application to the Recognition of Material Freshness, Adulteration Levels and Pretreatment Approaches for Tomato Juices

Thematic Concentration As a Discriminating Feature of Text Types

A Comparative Study on Chinese Word Clustering

A Text Clustering Algorithm to Detect Basic Level Categories in Texts

An Evaluation on Feature Selection for Text Clustering

Adaptive Approach to Fuzzy Clustering

A Comparative Study of A Practical Stochastic Clustering Method with Traditional Methods

A Comparative Study on Feature Weight in Text Categorization

The interelation in fuzzy clustering

Federated Learning for Short Text Clustering

A Comparative study Between Fuzzy Clustering Algorithm and Hard Clustering Algorithm

K-means clustering versus validation measures

Automatic Text Summarization Method Based on Improved TextRank Algorithm and K-Means Clustering

Algorithm and Experiment Research of Textual Document Clustering Based on Improved K-means