Information Retrieval in long documents: Word clustering approach for improving Semantics

Paul Mbate Mekontchou,Armel Fotsoh,Bernabe Batchakui,Eddy Ella
DOI: https://doi.org/10.48550/arXiv.2302.10150
2023-02-21
Abstract:In this paper, we propose an alternative to deep neural networks for semantic information retrieval for the case of long documents. This new approach exploiting clustering techniques to take into account the meaning of words in Information Retrieval systems targeting long as well as short documents. This approach uses a specially designed clustering algorithm to group words with similar meanings into clusters. The dual representation (lexical and semantic) of documents and queries is based on the vector space model proposed by Gerard Salton in the vector space constituted by the formed clusters. The originalities of our proposal are at several levels: first, we propose an efficient algorithm for the construction of clusters of semantically close words using word embedding as input, then we define a formula for weighting these clusters, and then we propose a function allowing to combine efficiently the meanings of words with a lexical model widely used in Information Retrieval. The evaluation of our proposal in three contexts with two different datasets SQuAD and TREC-CAR has shown that is significantly improves the classical approaches only based on the keywords without degrading the lexical aspect.
Information Retrieval
What problem does this paper attempt to address?