Abstract:The goal in similarity search is to find objects similar to a specified query object given a certain similarity criterion. Although useful in many areas, such as multimedia retrieval, pattern recognition, and computational biology, to name but a few, similarity search is not yet supported well by commercial DBMS. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. We propose an efficient disk-based metric access method, the Space-filling curve and Pivot-based B+-tree (SPB-tree), to support a wide range of data types and similarity metrics. The SPB-tree uses a small set of so-called pivots to reduce significantly the number of distance computations, uses a space-filling curve to cluster the data into compact regions, thus improving storage efficiency, and utilizes a B+-tree with minimum bounding box information as the underlying index. The SPB-tree also employs a separate random access file to efficiently manage a large and complex data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient similarity search algorithms and corresponding cost models based on the SPB-tree. Extensive experiments using real and synthetic data show that the SPB-tree has much lower construction cost, smaller storage size, and can support more efficient similarity queries with high accuracy cost models than is the case for competing techniques. Moreover, the SPB-tree scales sublinearly with growing dataset size.

Efficient Similarity Search for Tree-Structured Data

An Efficient Framework for Exact Set Similarity Search Using Tree Structure Indexes.

Extend Tree Edit Distance for Effective Object Identification

A unified framework for string similarity search with edit-distance constraint

Similarity Metric for XML Documents

Effective Indices for Efficient Approximate String Search and Similarity Join

Approximate top-k structural similarity search over XML documents

Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search

Measuring Similarity of Web Pages on Maximum Isomorphic Subtree

Efficient Similarity Join and Search on Multi-Attribute Data

SFTM: Fast Comparison of Web Documents using Similarity-based Flexible Tree Matching

Fast Comparative Analysis of Merge Trees Using Locality Sensitive Hashing

Efficient Graph Similarity Search over Large Graph Databases

Trie-join: a Trie-Based Method for Efficient String Similarity Joins

Evaluate Structure Similarity In Xml Documents With Merge-Edit-Distance

Top-k String Similarity Search with Edit-Distance Constraints

Finding Dependency Trees from Binary Data

Accelerating Sequence Searching: Dimensionality Reduction Method

Efficient Metric Indexing for Similarity Search

A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints

Efficient Computation of the Tree Edit Distance