Random mappings designed for commercial search engines

Roger Donaldson,Arijit Gupta,Yaniv Plan,Thomas Reimer
DOI: https://doi.org/10.48550/arXiv.1507.05929
2015-07-22
Abstract:We give a practical random mapping that takes any set of documents represented as vectors in Euclidean space and then maps them to a sparse subset of the Hamming cube while retaining ordering of inter-vector inner products. Once represented in the sparse space, it is natural to index documents using commercial text-based search engines which are specialized to take advantage of this sparse and discrete structure for large-scale document retrieval. We give a theoretical analysis of the mapping scheme, characterizing exact asymptotic behavior and also giving non-asymptotic bounds which we verify through numerical simulations. We balance the theoretical treatment with several practical considerations; these allow substantial speed up of the method. We further illustrate the use of this method on search over two real data sets: a corpus of images represented by their color histograms, and a corpus of daily stock market index values.
Information Retrieval,Information Theory
What problem does this paper attempt to address?