Abstract:Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.

LHS: A Novel Method of Information Retrieval Avoiding an Index Using Linear Hashing with Key Groups in Deduplication.

A Fast Duplicate Chunk Identifying Method Based on Hierarchical Indexing Structure

Similarity and Locality Based Indexing for High Performance Data Deduplication.

CPI: A Collaborative Partial Indexing Design for Large-Scale Deduplication Systems

Zero-Chunk: An Efficient Cache Algorithm to Accelerate the I/O Processing of Data Deduplication

HashFile: An efficient index structure for multimedia data

An Efficient and Compact Indexing Scheme for Large-Scale Data Store.

A Novel Optimization Method to Improve De-duplication Storage System Performance

Using Multi-Threads to Hide Deduplication I/O Latency with Low Synchronization Overhead

Deduplication Model Based on File-Similarity Clustering

Utilizing the column imprints to accelerate no‐partitioning hash joins in large‐scale edge systems

Towards the design of efficient hash-based indexing scheme for growing databases on non-volatile memory

Data-oriented locality sensitive hashing.

LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

Sk-Lsh : An Efficient Index Structure For Approximate Nearest Neighbor Search

SiLo: a Similarity-Locality Based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput

Optimization for Data De-Duplication Algorithm Based on File Content

Multiple-Loads Deduplication Method Based on Improved Sparse Indexing

SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers

A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System

Dynamic Clustering-based Sharding in Distributed Deduplication Systems.