A Comparison Between Term-Based and Embedding-Based Methods for Initial Retrieval

Tonglei Guo,Jiafeng Guo,Yixing Fan,Yanyan Lan,Jun Xu,Xueqi Cheng
DOI: https://doi.org/10.1007/978-3-030-01012-6_3
2018-01-01
Abstract:The initial retrieval stage of information retrieval aims to generate as many relevant candidate documents as possible in a simple yet efficient way. Traditional term based retrieval methods like BM25 deal with the problem based on Bag-of-Words (BoW) representation, thus they only focus on exact matching (i.e., syntactic) and lack the consideration for semantically related words. That causes the typical vocabulary mismatch problem and the reduction of performance in terms of recall. The advance of distributed representation (i.e., embedding) of words and documents provides an efficient way to measure the semantic relevance between words. Since embedding can alleviate the vocabulary mismatch problem, it is suitable for the initial retrieval task. We conduct several experiments to compare term basedmodels with embedding basedmodels in terms of recall. We compare above two branches of the initial retrieval models on three representative retrieval tasks (Web-QA, Ad-hoc retrieval and CQA respectively). The results show that embedding based method and term based method are complementary for each other and higher recall can be achieved by combining the above two types of models based on scores or ranking position. We find that combination of the two types of the models based on ranking position usually perform better than combination based on score. Furthermore, since queries and documents are in different forms for diverse application scenarios, it can be observed that the relative performance of the two types are almost same but the absolute performance are significant different regarding to distinct scenarios.
What problem does this paper attempt to address?