Abstract:Large language models (LLMs) face significant challenges stemming from their inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On the one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios.
In this work, we present a novel approach, the LLM-Embedder, which comprehensively supports the diverse retrieval augmentation needs of LLMs with one unified embedding model. Training such a unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task fine-tuning with explicit instructions, and homogeneous in-batch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. Our checkpoint and source code are publicly available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on several core challenges faced by large - language models (LLMs), which stem from the inherent limitations of LLMs, including limitations in knowledge capacity, memory ability, and action ability. Specifically:
1. **Knowledge Boundary**: Due to the finiteness of model parameters, LLMs cannot fully internalize the extensive knowledge in the world. Moreover, their internal knowledge is static and difficult to update with the dynamically changing world. Furthermore, LLMs are mainly trained on publicly available high - frequency data, which may lead to inaccuracies when dealing with specific - domain or long - tail knowledge.
2. **Memory Boundary**: LLMs also face severe limitations in memory, mainly because of the limitation of context length. Although continuous progress has been made in expanding the maximum context length, there is still a gap from achieving the goal of lifelong interaction with human users. Meanwhile, LLMs with extended context may require excessive computational and storage resources during training and deployment, making it impractical to significantly expand their memory.
3. **Ability Boundary**: The abilities of LLMs are limited by action and autonomy. First, they are confined to the "language space" and cannot interact meaningfully with the physical world. Second, these models are highly dependent on human guidance and require clear user instructions and appropriate demonstration examples to effectively perform specific tasks.
To overcome these limitations, external assistance is introduced through retrieval - augmented generation. However, existing methods have two main problems: on the one hand, general - purpose retrievers are not properly optimized for retrieval - augmentation of LLMs; on the other hand, task - specific retrievers lack the required flexibility, limiting their performance in different retrieval - augmentation scenarios.
For this reason, this paper proposes a new method named LLM - Embedder, aiming to comprehensively support the diverse retrieval - augmentation needs of LLMs through a unified embedding model. This model systematically optimizes training methods, including reward formulation based on LLMs' feedback, stabilization of knowledge distillation, multi - task fine - tuning with explicit instructions, and homogeneous in - batch negative sample sampling, to address the above challenges. Experimental results show that LLM - Embedder significantly outperforms general - purpose and task - specific retrievers in various evaluation scenarios, effectively enhancing the retrieval - augmentation effect of LLMs.