Abstract:Large language models (LLMs) face significant challenges stemming from their inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On the one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios. In this work, we present a novel approach, the LLM-Embedder, which comprehensively supports the diverse retrieval augmentation needs of LLMs with one unified embedding model. Training such a unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task fine-tuning with explicit instructions, and homogeneous in-batch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. Our checkpoint and source code are publicly available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several core challenges faced by large - language models (LLMs), which stem from the inherent limitations of LLMs, including limitations in knowledge capacity, memory ability, and action ability. Specifically: 1. **Knowledge Boundary**: Due to the finiteness of model parameters, LLMs cannot fully internalize the extensive knowledge in the world. Moreover, their internal knowledge is static and difficult to update with the dynamically changing world. Furthermore, LLMs are mainly trained on publicly available high - frequency data, which may lead to inaccuracies when dealing with specific - domain or long - tail knowledge. 2. **Memory Boundary**: LLMs also face severe limitations in memory, mainly because of the limitation of context length. Although continuous progress has been made in expanding the maximum context length, there is still a gap from achieving the goal of lifelong interaction with human users. Meanwhile, LLMs with extended context may require excessive computational and storage resources during training and deployment, making it impractical to significantly expand their memory. 3. **Ability Boundary**: The abilities of LLMs are limited by action and autonomy. First, they are confined to the "language space" and cannot interact meaningfully with the physical world. Second, these models are highly dependent on human guidance and require clear user instructions and appropriate demonstration examples to effectively perform specific tasks. To overcome these limitations, external assistance is introduced through retrieval - augmented generation. However, existing methods have two main problems: on the one hand, general - purpose retrievers are not properly optimized for retrieval - augmentation of LLMs; on the other hand, task - specific retrievers lack the required flexibility, limiting their performance in different retrieval - augmentation scenarios. For this reason, this paper proposes a new method named LLM - Embedder, aiming to comprehensively support the diverse retrieval - augmentation needs of LLMs through a unified embedding model. This model systematically optimizes training methods, including reward formulation based on LLMs' feedback, stabilization of knowledge distillation, multi - task fine - tuning with explicit instructions, and homogeneous in - batch negative sample sampling, to address the above challenges. Experimental results show that LLM - Embedder significantly outperforms general - purpose and task - specific retrievers in various evaluation scenarios, effectively enhancing the retrieval - augmentation effect of LLMs.

Retrieve Anything To Augment Large Language Models

A Multi-Task Embedder For Retrieval Augmented LLMs

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

Retrieval-Augmented Retrieval: Large Language Models Are Strong Zero-Shot Retriever.

Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In

Making Large Language Models A Better Foundation For Dense Retrieval

Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models

Large Language Models are Strong Zero-Shot Retriever

Reliable, Adaptable, and Attributable Language Models with Retrieval

Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

EmbedLLM: Learning Compact Representations of Large Language Models

RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit

When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation

Bridging the Preference Gap between Retrievers and LLMs

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models

Embedding-Aligned Language Models

Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment