Abstract:• A task-parallel framework for local structural vertex similarity calculation. • A threshold-adaptive technique to execute a similarity calculation task. • A Spark-based implementation with three optimization techniques. • Comprehensive performance evaluation of the vertex similarity calculation strategies. In many graph analytical applications, the local structural vertex similarity calculation is an essential prerequisite for advanced graph mining. The similarity calculation finds out all the similar vertex pairs whose local structural similarity scores (like the number of common neighbors, and the Jaccard index of adjacency sets) are above a given threshold. The real-world applications use a wide range of similarity thresholds. However, the existing distributed methods for the problem only optimize for either high thresholds (> 0.7) or low thresholds (< 0.1). To overcome the drawback, we propose a new distributed vertex similarity calculation framework VSIM that is efficient under a broad range of thresholds. VSIM processes static undirected graphs with local structural similarity scores that measure the similarity between vertices based on the first-order topology information. VSIM generates a similarity calculation task for every vertex in the graph and conducts all the tasks in parallel on a distributed computing platform along with a distributed key-value store. Each task finds vertices similar to a given center vertex with two task execution modes. The two modes optimize for high and low thresholds, respectively. Each task picks the suitable mode adaptively according to cost estimation models. We also propose an efficient implementation for VSIM on Apache Spark, with three optimization techniques to reduce communication costs and balance workloads on power-law graphs. The experimental evaluation shows that VSIM outperforms the state-of-the-art distributed methods by up to 67x speedup. VSIM can achieve near-linear node scalability in low-threshold and small cache scenarios.

Scalable Single-Source SimRank Computation for Large Graphs.

UniWalk: Unidirectional Random Walk Based Scalable SimRank Computation over Large Graph

An efficient similarity search framework for SimRank over large dynamic graphs

Massively Parallel Single-Source SimRanks in $o(\log n)$ Rounds

An Experimental Evaluation of SimRank-based Similarity Search Algorithms

Efficient Single-Source SimRank Query by Path Aggregation

Efficient SimRank-based Similarity Join over Large Graphs.

ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy Guarantee

SLING: A Near-Optimal Index Structure for SimRank

Unified and Incremental SimRank: Index-free Approximation with Scheduled Principle

Fast Single-Pair SimRank Computation

Efficient index-free SimRank similarity search in large graphs by discounting path lengths

A Parallel Method for All-Pair SimRank Similarity Computation.

ExactSim: benchmarking single-source SimRank algorithms with high-precision ground truths

VSIM: Distributed Local Structural Vertex Similarity Calculation on Big Graphs

Fast Approximate CoSimRanks via Random Projections

Local Methods for Estimating SimRank Score

Calculating Similarity Efficiently in a Small World

Comprehensively Computing Link-based Similarities by Building A Random Surfer Graph

Assessing Single-Pair Similarity over Graphs by Aggregating First-Meeting Probabilities

A Fast Two-Stage Algorithm For Computing Simrank And Its Extensions