Shuffle-Efficient Distributed Locality Sensitive Hashing On Spark

Wanxin Zhang,Dongsheng Li,Ying Xu,Yiming Zhang
DOI: https://doi.org/10.1109/INFCOMW.2016.7562179
2016-01-01
Abstract:Locality Sensitive Hashing (LSH) is an important indexing technique for approximate similarity search in high-dimensional spaces. An obvious limitation of LSH approaches is the lack of capability and scalability to deal with massive data. This paper proposes a distributed variant of LSH called Spark-LSH, which is implemented on Apache Spark, a well-known distributed computing framework. We design a shuffle-efficient indexing scheme for the Spark-LSH, which can reduce the data shuffle and improve the network efficiency when constructing the hash table indices. Furthermore, we propose a location-aware querying scheme to improve the query performance. Experiments show that the Spark-LSH scheme can reduce the network shuffle overhead remarkably and accelerate the query significantly.
What problem does this paper attempt to address?