Abstract:Similarity search is a crucial task in multimedia retrieval and data mining. Most existing work has modelled this problem as the nearest neighbor (NN) problem, which considers the distance between the query object and the data objects over a fixed set of features. Such an approach has two drawbacks: 1) it leaves many partial similarities uncovered; 2) the distance is often affected by a few dimensions with high dissimilarity. To overcome these drawbacks, we propose the k-n -match problem in this paper.The k-n -match problem models similarity search as matching between the query object and the data objects in n dimensions, where n is a given integer smaller than dimensionality d and these n dimensions are determined dynamically to make the query object and the data objects returned in the answer set match best. The k-n -match query is expected to be superior to the kNN query in discovering partial similarities, however, it may not be as good in identifying full similarity since a single value of n may only correspond to a particular aspect of an object instead of the entirety. To address this problem, we further introduce the frequent k-n -match problem, which finds a set of objects that appears in the k-n -match answers most frequently for a range of n values. Moreover, we propose search algorithms for both problems. We prove that our proposed algorithm is optimal in terms of the number of individual attributes retrieved, which is especially useful for information retrieval from multiple systems. We can also apply the proposed algorithmic strategy to achieve a disk based algorithm for the (frequent) k-n -match query. By a thorough experimental study using both real and synthetic data sets, we show that: 1) the k-n -match query yields better result than the kNN query in identifying similar objects by partial similarities; 2) our proposed method (for processing the frequent k-n -match query) outperforms existing techniques for similarity search in terms of both effectiveness and efficiency.

SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints

Matching Images Based on Consistency Graph and Region Adjacency Graphs.

An Efficient Partition Based Method for Exact Set Similarity Joins

Overlap Set Similarity Joins with Theoretical Guarantees.

Similarity Search: A Matching Based Approach.

A Constrained Clustering Based Approach for Matching a Collection of Feature Sets

Enhanced Fast Boolean Matching Based on Sensitivity Signatures Pruning

Large Scale Instance Matching Via Multiple Indexes and Candidate Selection

A faster algorithm for the limited-capacity many-to-many point matching in one dimension

Subgraph Matching with Set Similarity in a Large Graph Database

SimiSketch: Efficiently Estimating Similarity of streaming Multisets

Measuring semantic relatedness between Flickr images: from a social tag based view.

Layered Graph Matching by Composite Cluster Sampling with Collaborative and Competitive Interactions

No-But-Semantic-Match: Computing Semantically Matched XML Keyword Search Results

Editorial: Efficient discovery of similarity constraints for matching dependencies

Tight Correlated Item Sets And Their Efficient Discovery

Match<SUP>2</SUP>: A Matching over Matching Model for Similar Question Identification

A Faster Combinatorial Algorithm for Maximum Bipartite Matching

On seeded subgraph-to-subgraph matching: The ssSGM Algorithm and matchability information theory

A Similarity Measure for Weaving Patterns in Textiles

Semantic Matching of Documents from Heterogeneous Collections: A Simple and Transparent Method for Practical Applications