Fast and Flexible Top-k Similarity Search on Large Networks

Jing Zhang,Jie Tang,Cong Ma,Hanghang Tong,Yu Jing,Juanzi Li,Walter Luyten,Marie-Francine Moens
DOI: https://doi.org/10.1145/3086695
2017-01-01
Abstract:Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ε, the confidence level (1-δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods. We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.
What problem does this paper attempt to address?