Abstract:Hashing has been widely applied to the large-scale approximate nearest neighbor search problem owing to its high efficiency and low storage requirement. Most investigations concentrate on learning hashing methods in a centralized setting. However, in existing big data systems, data is often stored across different nodes. In some situations, data is even collected in a distributed manner. A straightforward way to solve this problem is to aggregate all the data into the fusion center to obtain the search result (aggregating method). However, this strategy is not feasible because of the prohibitive communication cost. Although a few distributed hashing methods have been proposed to reduce this cost, they only focus on designing a distributed algorithm for a specific global optimization objective without considering scalability. Moreover, existing distributed hashing methods aim at finding a distributed solution to hashing, meanwhile avoiding accuracy loss, rather than improving accuracy. To address these challenges, we propose a Scalable Distributed Hashing (SDisH) model in which most existing hashing methods can be extended to process distributed data with no changes. Furthermore, to improve accuracy, we utilize the search radius as a global variable across different nodes to achieve a global optimum search result for every iteration. In addition, a voting algorithm is presented based on the results produced by multiple iterations to further reduce search errors. Theoretical analyses of communication, computation, and accuracy demonstrate the superiority of the proposed model. Numerical simulations on three large-scale and two relatively small benchmark datasets also show that the SDisH model achieves up to 44.75% and 10.23% accuracy gains compared to the aggregating method and state-of-the-art distributed hashing methods, respectively.

Shuffle-Efficient Distributed Locality Sensitive Hashing On Spark

Distributed High-Dimension Matrix Operation Optimization on Spark

Learning-based distributed locality sensitive hashing.

Efficient Locality-Sensitive Hashing over High-Dimensional Streaming Data.

Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud

Distribution-Aware Locality Sensitive Hashing

Data Independent Method of Constructing Distributed LSH for Large-Scale Dynamic High-Dimensional Indexing

CLSH: Cluster-based Locality-Sensitive Hashing

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Data-oriented locality sensitive hashing.

Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest Neighbors

Preserving-Ignoring Transformation Based Index for Approximate k Nearest Neighbor Search

A New Design of High-Performance Large-Scale GIS Computing at a Finer Spatial Granularity: A Case Study of Spatial Join with Spark for Sustainability

An Improved Method of Locality Sensitive Hashing for Indexing Large-Scale and High-Dimensional Features

Lazylsh: Approximate Nearest Neighbor Search For Multiple Distance Functions With A Single Index

Implementing and Evaluating E2LSH on Storage

Scalable Distributed Hashing for Approximate Nearest Neighbor Search

Frequency Based Locality Sensitive Hashing

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

DB-LSH: Locality-Sensitive Hashing with Query-based Dynamic Bucketing

Entropy-based Outlier Detection Using Spark