Efficient and Accurate SimRank-Based Similarity Joins: Experiments, Analysis, and Improvement
Qian Ge,Yu Liu,Yinghao Zhao,Yuetian Sun,Lei Zou,Yuxing Chen,Anqun Pan
DOI: https://doi.org/10.14778/3636218.3636219
2024-01-01
Abstract:SimRank-based similarity joins, which mainly include threshold-based and top- k similarity joins, are important types of all-pair SimRank queries. Although a line of related algorithms have been proposed recently, they still fall short of providing approximation guarantee and suffer from scalability issues on medium and large graphs. Meanwhile, we also lack an extensive analysis of existing techniques in terms of accuracy and efficiency. Motivated by these challenges, we first conduct detailed analysis of state-of-the-art algorithms and provide additional theoretical results. Second, to address the limitations of existing techniques, we propose simple yet effective algorithm frameworks for both queries to theoretically guarantee the approximation bound, and present a more efficient all-pair algorithm inspired by randomized local push of Personalized PageRank. Next, we analyze the algorithmic complexity of threshold-based and top- k similarity joins by leveraging a reasonable assumption of SimRank distribution. Through extensive experiments, we find that our proposed methods far exceed existing ones with respect to query efficiency, approximation guarantee and practical accuracy, while our theoretical analysis nicely matches the empirical study.