Towards Robust Text Retrieval with Progressive Learning

Tong Wu,Yulei Qin,Enwei Zhang,Zihan Xu,Yuting Gao,Ke Li,Xing Sun

2023-11-20

Abstract:Retrieval augmentation has become an effective solution to empower large language models (LLMs) with external and verified knowledge sources from the database, which overcomes the limitations and hallucinations of LLMs in handling up-to-date and domain-specific information. However, existing embedding models for text retrieval usually have three non-negligible limitations. First, the number and diversity of samples in a batch are too restricted to supervise the modeling of textual nuances at scale. Second, the high proportional noise are detrimental to the semantic correctness and consistency of embeddings. Third, the equal treatment to easy and difficult samples would cause sub-optimum convergence of embeddings with poorer generalization. In this paper, we propose the PEG, a progressively learned embeddings for robust text retrieval. Specifically, we increase the training in-batch negative samples to 80,000, and for each query, we extracted five hard negatives. Concurrently, we incorporated a progressive learning mechanism, enabling the model to dynamically modulate its attention to the samples throughout the entire training process. Additionally, PEG is trained on more than 100 million data, encompassing a wide range of domains (e.g., finance, medicine, and tourism) and covering various tasks (e.g., question-answering, machine reading comprehension, and similarity matching). Extensive experiments conducted on C-MTEB and DuReader demonstrate that PEG surpasses state-of-the-art embeddings in retrieving true positives, highlighting its significant potential for applications in LLMs. Our model is publicly available at <a class="link-external link-https" href="https://huggingface.co/TownsWu/PEG" rel="external noopener nofollow">this https URL</a>.

Information Retrieval,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address several key issues faced by contrastive learning methods in the field of text retrieval to improve the robustness and generalization ability of large language models (LLMs) in text retrieval tasks. Specifically, existing methods have the following three main limitations: 1. **Insufficient sample quantity and diversity**: The limited number and diversity of samples in a batch cannot effectively supervise the modeling of subtle differences in large-scale texts. 2. **High noise ratio**: A high proportion of noise is detrimental to the semantic correctness and consistency of embedding vectors. 3. **Equal treatment of easy and hard samples**: Treating easy and hard samples equally leads to poor convergence of embedding vectors and weak generalization ability. To address these issues, the authors propose PE-G (Progressively Enhanced Generative Embeddings), which increases the number of negative samples in the training batch (up to 80,000) and extracts the five hardest negative samples for each query for hard negative mining. Additionally, PE-G introduces a progressive learning mechanism that allows the model to dynamically adjust its focus on samples of different difficulty levels throughout the training process. PE-G was trained on over 100 million data points, covering a wide range of domains (such as finance, medicine, tourism, etc.), and achieved significant performance improvements on multiple downstream tasks (such as question answering, machine reading comprehension, similarity matching, etc.). Experimental results show that PE-G outperforms existing state-of-the-art models on multiple benchmarks such as C-MTEB and DuReader.

Towards Robust Text Retrieval with Progressive Learning

Retrieve Anything To Augment Large Language Models

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

A Multi-Task Embedder For Retrieval Augmented LLMs

Making Large Language Models A Better Foundation For Dense Retrieval

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

Robust Textual Embedding Against Word-level Adversarial Attacks

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation

PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models

LLMEmbed: Rethinking Lightweight LLM's Genuine Function in Text Classification

Efficient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (clp)

TELLMe: Teaching and Exploiting Large Language Models for Model Selection in Text Retrieval

PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Augmented Embeddings for Custom Retrievals

Making Text Embedders Few-Shot Learners

UATVR: Uncertainty-Adaptive Text-Video Retrieval

QAEA-DR: A Unified Text Augmentation Framework for Dense Retrieval