Abstract:The problem of similarity queries has received much attention in recent years due to its wide applications in many new and emerging areas. The objective of this thesis is to develop and analyze novel algorithms to support similarity queries using the vector model. In the thesis, we first discuss supporting similarity queries in multidimensional Non-ordered Discrete Data Spaces (NDDS), which are very important for application areas such as Data Mining and Bioinformatics. Existing indexing methods developed for Continuous Data Spaces (CDS) cannot be directly applied to an NDDS due to a lack of some essential geometric concepts/properties. To solve this problem, we established discrete geometrical concepts, which have similar counter parts in a CDS. Based on these concepts, we have developed two novel indexing structures, called the ND-tree and the NSP-tree. The ND-tree is the first index structure of its kind, whose construction algorithms are designed based on the special properties of the NDDS using a data-partitioning approach. The NSP-tree is also based on the special properties of the NDDS but it uses space-partitioning techniques and new strategies such as a partition of the actual data space instead of the whole space and the application of more than one minimum bounding rectangles per node. Our extensive studies show that the performance of the ND-tree and the NSP-tree is significantly better than those of the existing methods. The NSP-tree is shown to be particularly efficient for large skewed datasets. We have proposed the NDh-tree to support similarity queries in Hybrid Data Spaces (HDS), which contain both continuous and non-ordered discrete dimensions. As an extension of the ND-tree, the NDh-tree is developed based on geometrical concepts defined for an HDS and is capable of handling continuous dimensions efficiently. Our experimental results show that the NDh-tree is a promising indexing structure for HDSs. The thesis also addresses the problem of choosing a suitable distance measure for similarity queries using the vector model. The standard criteria for selection of an appropriate distance measure are yet to be found. But in this thesis, we have provided a basis for comparing distance measures for similarity queries. We have done this by introducing a theoretical model to analyze the relationship between two commonly used distance measures, i.e., the Euclidean distance and the cosine angle distance, in multidimensional data spaces. Similar methodology proposed for the model can be used to analyze other distance measures such as the Manhattan distance. We believe that this work provides the fundamental basis for understanding and comparing distance measures for similarity queries.

A Space-Partitioning-based Indexing Method for Multidimensional Non-Ordered Discrete Data Spaces

Dynamic Indexing for Multidimensional Non-Ordered Discrete Data Spaces Using a Data-Partitioning Approach

The ND-Tree: A Dynamic Indexing Technique for Multidimensional Non-ordered Discrete Data Spaces.

Space-Partitioning-Based Bulk-Loading for the NSP-Tree in Non-ordered Discrete Data Spaces

The C-ND Tree: a Multidimensional Index for Hybrid Continuous and Non-Ordered Discrete Data Spaces

Principles and applications for supporting similarity queries in non-ordered-discrete and continuous data spaces

Bulk-Loading The Nd-Tree In Non-Ordered Discrete Data Spaces

A Study of Indexing Strategies for Hybrid Data Spaces

An Efficient Peer-to-peer Indexing Tree Structure for Multidimensional Data.

Efficient Metric Indexing for Similarity Search

Indexing High-Dimensional Data in Dual Distance Spaces

Indexing high-dimensional data in dual distance spaces: a symmetrical encoding approach

Z Tree: an Index Structure for High-dimensional Data

Towards a Painless Index for Spatial Objects

iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

Dynamic High Dimensional Data Mapping for Efficient Similarity Query Processing

The Ss+-Tree: An Improved Index Structure For Similarity Searches In A High-Dimensional Feature Space

SDI: a Swift Tree Structure for Multi-Dimensional Data Indexing in Peer-to-peer Networks

Parallel indexing technique for spatio-temporal data

KSR-Tree: A Clustering Based High-Dimensional Indexing Approach

DPsIR~+:A Distributed and Parallel Spatial Index Tree Based on Dynamic Spatial Slot