Discovering Canonical Correlations between Topical and Topological Information in Document Networks

Yuan He,Cheng Wang,Changjun Jiang
DOI: https://doi.org/10.1145/2806416.2806518
2018-01-01
Abstract:Document network is a kind of intriguing dataset which can provide both topical (textual content) and topological (relational link) information. A key point in viably modeling such datasets is to discover proper denominators beneath the two different types of data, text and link. Most previous work introduces the assumption that documents closely linked with each other share common latent topics. However, the heterophily (i.e., tendency to link to different others) of nodes is neglected, which is pervasive in social networks. In this paper, we simultaneously incorporate community detection and topic modeling in a unified framework, and appeal to Canonical Correlation Analysis (CCA) to capture the latent semantic correlations between the two heterogeneous latent factors, community and topic. Despite of the homophily (i.e., tendency to link to similar others) or heterophily, CCA can properly capture the inherent correlations which fit the dataset itself without any prior hypothesis. Logistic normal prior is also employed in modeling network to better capture the community correlations. We derive efficient inference and learning algorithms based on variational EM methods. The effectiveness of our proposed model is comprehensively verified on three different types of datasets which are namely hyperlinked networks of web pages, social networks of friends and coauthor networks of publications. Experimental results show that our approach achieves significant improvements on both topic modeling and community detection compared with the current state of the art. Meanwhile, our model is impressive in discovering correlations between extracted topics and communities.
What problem does this paper attempt to address?