Abstract:Large language models (LLMs) face significant challenges stemming from their inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On the one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios. In this work, we present a novel approach, the LLM-Embedder, which comprehensively supports the diverse retrieval augmentation needs of LLMs with one unified embedding model. Training such a unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task fine-tuning with explicit instructions, and homogeneous in-batch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. Our checkpoint and source code are publicly available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding" rel="external noopener nofollow">this https URL</a>.

Multi-Lingual Malaysian Embedding: Leveraging Large Language Models for Semantic Representations

Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding

MaLLaM -- Malaysia Large Language Model

MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal

Personal Intelligence System UniLM: Hybrid On-Device Small Language Model and Server-Based Large Language Model for Malay Nusantara

MINERS: Multilingual Language Models as Semantic Retrievers

Language Models are Universal Embedders

Retrofitting Multilingual Sentence Embeddings with Abstract Meaning Representation

Bridging the Gap: Transfer Learning from English PLMs to Malaysian English

EmbedLLM: Learning Compact Representations of Large Language Models

LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems

Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval

Bilingual Adaptation of Monolingual Foundation Models

Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings

Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education

Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models

Deep Learning Paradigm with Transformed Monolingual Word Embeddings for Multilingual Sentiment Analysis

Retrieve Anything To Augment Large Language Models

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding