NV-Retriever: Improving text embedding models with effective hard-negative mining

Gabriel de Souza P. Moreira,Radek Osmulski,Mengyao Xu,Ronay Ak,Benedikt Schifferer,Even Oldridge
2024-07-23
Abstract:Text embedding models have been popular for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). Those models are typically Transformer models that are fine-tuned with contrastive learning objectives. Many papers introduced new embedding model architectures and training approaches, however, one of the key ingredients, the process of mining negative passages, remains poorly explored or described. One of the challenging aspects of fine-tuning embedding models is the selection of high quality hard-negative passages for contrastive learning. In this paper we propose a family of positive-aware mining methods that leverage the positive relevance score for more effective false negatives removal. We also provide a comprehensive ablation study on hard-negative mining methods over their configurations, exploring different teacher and base models. We demonstrate the efficacy of our proposed methods by introducing the NV-Retriever-v1 model, which scores 60.9 on MTEB Retrieval (BEIR) benchmark and 0.65 points higher than previous methods. The model placed 1st when it was published to MTEB Retrieval on July 07, 2024.
Information Retrieval,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to more effectively select high - quality hard - negative samples in text embedding models, so as to improve the performance of these models in information retrieval tasks. Specifically, the author focuses on how to improve text - embedding models based on contrastive learning through effective hard - negative sample mining methods. Contrastive learning usually requires a query, a positive sample and a negative sample triple for training. Among them, the selection of negative samples is crucial to the performance of the model. Traditional methods of randomly or simply selecting negative samples are inefficient and prone to introducing noise. Therefore, the paper proposes a new positive - sample - aware hard - negative sample mining method, aiming to better remove potential false - negative samples, thereby improving the contrastive learning effect of the model. Through this method, the author of the paper hopes to achieve higher information retrieval accuracy in applications such as semantic search and retrieval - augmented generation (RAG) - based question - answering systems.