Efficient index-free SimRank similarity search in large graphs by discounting path lengths
Mingxi Zhang,Liuqian Yang,Hangfei Hu,Tianxing Liu,Jinhua Wang
DOI: https://doi.org/10.1016/j.eswa.2022.117746
IF: 8.5
2022-01-01
Expert Systems with Applications
Abstract:Link-based similarity search aims to find similar nodes for a given query node in a graph, which arises in numerous applications, including web spam detection, social network analysis and web search. Among existing methods, SimRank is a well-known similarity model, which provides an effective and trustful function for similarity search. A large amount of techniques on SimRank similarity search are devoted recently, which compute the similarity scores by traversing the paths between query and candidate nodes. However, the number of paths increases exponentially as path length increases, which makes the computation expensive and cannot support fast similarity search over large graphs. In this paper, we propose an efficient index-free SimRank similarity search approach, namely DisSim, which reduces the computational cost by discounting path length. We observe that SimRank could rapidly converge at a stable state and the results change little after a few of iterations. Based on the fast convergence, the similarity between nodes is defined as the SimRank score at the second iteration. For the computation of DisSim, we divide the similarity into one-step and two-step first-meeting probabilities. The one-step first-meeting probabilities are computed by path traverses from query to candidate nodes, which reduces computational cost by skipping unnecessary nodes. And the two-step first-meeting probabilities are computed by integrating the repeated parts of the paths. For further speeding up query processing, we develop a pruning algorithm, which prunes unpromising path traverses by setting a threshold, and the accuracy loss under threshold is given through mathematical analysis. Extensive experiments on real graphs demonstrate the performance of DisSim through comparing with the state-of-the-art algorithms.