Abstract:Motivation: Many tasks in sequence analysis ask to identify biologically related sequences in a large set. The edit distance, being a sensible model for both evolution and sequencing error, is widely used in these tasks as a measure. The resulting computational problem-to recognize all pairs of sequences within a small edit distance-turns out to be exceedingly difficult, since the edit distance is known to be notoriously expensive to compute and that all-versus-all comparison is simply not acceptable with millions or billions of sequences. Among many attempts, we recently proposed the locality-sensitive bucketing (LSB) functions to meet this challenge. Formally, a (d1,d2)-LSB function sends sequences into multiple buckets with the guarantee that pairs of sequences of edit distance at most d1 can be found within a same bucket while those of edit distance at least d2 do not share any. LSB functions generalize the locality-sensitive hashing (LSH) functions and admit favorable properties, with a notable highlight being that optimal LSB functions for certain (d1,d2) exist. LSB functions hold the potential of solving above problems optimally, but the existence of LSB functions for more general (d1,d2) remains unclear, let alone constructing them for practical use. Results: In this work, we aim to utilize machine learning techniques to train LSB functions. With the development of a novel loss function and insights in the neural network structures that can potentially extend beyond this specific task, we obtained LSB functions that exhibit nearly perfect accuracy for certain (d1,d2), matching our theoretical results, and high accuracy for many others. Comparing to the state-of-the-art LSH method Order Min Hash, the trained LSB functions achieve a 2- to 5-fold improvement on the sensitivity of recognizing similar sequences. An experiment on analyzing erroneous cell barcode data is also included to demonstrate the application of the trained LSB functions. Availability and implementation: The code for the training process and the structure of trained models are freely available at https://github.com/Shao-Group/lsb-learn.

PartSS: an Efficient Partition-Based Filtering for Edit Distance Constraints

Improved LSH-driven String Similarity Join Filtering-Verification Framework

A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints

PASS-JOIN: A Partition-based Method for Similarity Joins

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

A unified framework for string similarity search with edit-distance constraint

Efficient Parallel Partition-Based Algorithms for Similarity Search and Join with Edit Distance Constraints

Top-k String Similarity Search with Edit-Distance Constraints

Set Similarity Join Using Partition Index

Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping

Efficient EMD-based Similarity Search Via Batch Pruning and Incremental Computation (extended Abstract)

An Efficient Partition Based Method for Exact Set Similarity Joins

A Prefix-Filter Based Method for Spatio-Textual Similarity Join.

Can we beat the prefix filtering?: an adaptive framework for similarity join and search.

Hash(Ed)-Join: Approximate String Similarity Join With Hashing

A Cost-Effective LSH Filter for Fast Pairwise Mining

Many Flavors of Edit Distance

Efficient String Similarity Join in Multi-Core and Distributed Systems.

Effective Indices for Efficient Approximate String Search and Similarity Join

Learning locality-sensitive bucketing functions

Extend Tree Edit Distance for Effective Object Identification