Abstract:The existing Retrieval-Augmented Generation (RAG) systems face significant challenges in terms of cost and effectiveness. On one hand, they need to encode the lengthy retrieved contexts before responding to the input tasks, which imposes substantial computational overhead. On the other hand, directly using generic Large Language Models (LLMs) often leads to sub-optimal answers, while task-specific fine-tuning may compromise the LLMs' general capabilities. To address these challenges, we introduce a novel approach called FlexRAG (Flexible Context Adaptation for RAG). In this approach, the retrieved contexts are compressed into compact embeddings before being encoded by the LLMs. Simultaneously, these compressed embeddings are optimized to enhance downstream RAG performance. A key feature of FlexRAG is its flexibility, which enables effective support for diverse compression ratios and selective preservation of important contexts. Thanks to these technical designs, FlexRAG achieves superior generation quality while significantly reducing running costs. Comprehensive experiments on various question-answering datasets validate our approach as a cost-effective and flexible solution for RAG systems.

What problem does this paper attempt to address?

This paper attempts to address the challenges faced by existing Retrieval - Augmented Generation (RAG) systems in terms of cost and effectiveness. Specifically, the main problems include: 1. **High computational cost**: Existing RAG systems need to encode the retrieved long texts, which brings a huge computational burden. For example, in handling multi - hop question - answering tasks, a series of related documents may need to be processed; and in general language modeling tasks, it may be necessary to iteratively retrieve diverse knowledge sources. 2. **Poor answer quality**: Directly using general large - scale language models (LLMs) usually leads to sub - optimal answer quality, especially in complex and noisy contexts. Although fine - tuning for specific tasks can improve the answer quality, this may damage the performance of LLMs on other general tasks. To solve these problems, the paper proposes a new method - FlexRAG (Flexible Context Adaptation for RAG). FlexRAG improves the existing RAG systems in the following ways: - **Compress the retrieved context**: Compress the retrieved context into a compact embedding representation, thereby significantly reducing the computational cost. - **Flexible compression ratio**: Support any compression ratio and selectively retain key information according to the importance of the context. - **Optimize performance**: Design a two - stage training process to ensure good cooperation between the compression module and the downstream LLM, while keeping the original parameters of the LLM unchanged to avoid performance degradation. In summary, FlexRAG aims to achieve a more efficient, more flexible and more cost - effective RAG system, thereby improving the generation quality and reducing the cost.

Lighter And Better: Towards Flexible Context Adaptation For Retrieval Augmented Generation

LightRAG: Simple and Fast Retrieval-Augmented Generation

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

SFR-RAG: Towards Contextually Faithful LLMs

RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems

xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Context Embeddings for Efficient Answer Generation in RAG

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation

In Defense of RAG in the Era of Long-Context Language Models

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Context Tuning for Retrieval Augmented Generation

Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity

Refiner: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities

AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models