HSS: A Hierarchical Semantic Similarity Hard Negative Sampling Method for Dense Retrievers.

Xinjia Xie,Feng Liu,Shun Gai,Zhen Huang,Minghao Hu,Ankun Wang
DOI: https://doi.org/10.1007/978-3-031-27818-1_25
2023-01-01
Abstract:Dense Retriever (DR) for Open-domain textual question answering (OpenQA), which aims to retrieve passages from large data sources like Wikipedia or Google, has gained wide attention in recent years. Although DR models continuously refresh state-of-the-art performances, their improvement relies on negative sampling during the training process. Existing sampling strategies mainly focus on developing a complex algorithm based on computer science, and ignore the abundant semantic features of datasets. We discover that there exists obvious changes in semantic similarity and present a three-level hierarchy of semantic similarity: same topic, same class, other class, whose rationality is further demonstrated by ablation study. Based on this, we propose a hard negative sampling strategy named Hierarchical Semantic Similarity (HSS). Our HSS model performs negative sampling at semantic levels of topic and class, and experimental results on four datasets show that it achieves comparable or better retrieval performance compared with existing competitive baselines. The code is available in https://github.com/redirecttttt/HSS.
What problem does this paper attempt to address?