Abstract:Similarity search is a crucial task in multimedia retrieval and data mining. Most existing work has modelled this problem as the nearest neighbor (NN) problem, which considers the distance between the query object and the data objects over a fixed set of features. Such an approach has two drawbacks: 1) it leaves many partial similarities uncovered; 2) the distance is often affected by a few dimensions with high dissimilarity. To overcome these drawbacks, we propose the k-n -match problem in this paper.The k-n -match problem models similarity search as matching between the query object and the data objects in n dimensions, where n is a given integer smaller than dimensionality d and these n dimensions are determined dynamically to make the query object and the data objects returned in the answer set match best. The k-n -match query is expected to be superior to the kNN query in discovering partial similarities, however, it may not be as good in identifying full similarity since a single value of n may only correspond to a particular aspect of an object instead of the entirety. To address this problem, we further introduce the frequent k-n -match problem, which finds a set of objects that appears in the k-n -match answers most frequently for a range of n values. Moreover, we propose search algorithms for both problems. We prove that our proposed algorithm is optimal in terms of the number of individual attributes retrieved, which is especially useful for information retrieval from multiple systems. We can also apply the proposed algorithmic strategy to achieve a disk based algorithm for the (frequent) k-n -match query. By a thorough experimental study using both real and synthetic data sets, we show that: 1) the k-n -match query yields better result than the kNN query in identifying similar objects by partial similarities; 2) our proposed method (for processing the frequent k-n -match query) outperforms existing techniques for similarity search in terms of both effectiveness and efficiency.

Combining an order-semisensitive text similarity and closest fit approach to textual missing values in knowledge discovery

A method for finding groups of related herbs in traditional chinese medicine

Semantic Similarity Measures to Disambiguate Terms in Medical Text.

Multi-label text categorization using k-nearest neighbor approach with m-similarity

Assessment of approximate string matching in a biomedical text retrieval problem

Missing Data Exploration: Highlighting Graphical Presentation of Missing Pattern.

Combining similarity measures in content-based image retrieval guided by mutual information

Combining data discretization and missing value imputation for incomplete medical datasets

Top-K Spatio-Textual Similarity Search

CTextEM: Using Consolidated Textual Data for Entity Matching

Document Similarity for Texts of Varying Lengths via Hidden Topics

A survey on the techniques, applications, and performance of short text semantic similarity

Hybrid Missing Value Imputation Algorithms Using Fuzzy C-Means and Vaguely Quantified Rough Set

A New Retrieval Model Based on TextTiling for Document Similarity Search

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

The Short Text Matching Model Enhanced with Knowledge via Contrastive Learning

An Ensemble Semantic Textual Similarity Measure Based on Multiple Evidences for Biomedical Documents

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

Multi-Intent Attribute-Aware Text Matching in Searching

Similarity Search: A Matching Based Approach.