Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes

Alexander B. Veretennikov
DOI: https://doi.org/10.48550/arXiv.2101.03327
2021-01-09
Abstract:Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the MaxDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of MaxDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments. This is a pre-print of a contribution published in Supplementary Proceedings of the XXII International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020), Voronezh, Russia, October 13-16, 2020, P. 336-350, published by CEUR Workshop Proceedings. The final authenticated version is available online at: <a class="link-external link-http" href="http://ceur-ws.org/Vol-2790/" rel="external noopener nofollow">this http URL</a>
Information Retrieval
What problem does this paper attempt to address?