VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu,Chaoyue Tang,Bokai Xu,Junbo Cui,Junhao Ran,Yukun Yan,Zhenghao Liu,Shuo Wang,Xu Han,Zhiyuan Liu,Maosong Sun

2024-10-14

Abstract:Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25--39\% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at <a class="link-external link-https" href="https://github.com/openbmb/visrag" rel="external noopener nofollow">this https URL</a> .

Information Retrieval,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to use visual information for Retrieval - Augmented Generation (RAG) in multimodal documents. Traditional RAG systems are mainly text - based and cannot fully utilize visual information such as layout and images in documents, which is an important limitation in practical applications. This article introduces a method named VisRAG. By using a Vision - Language Model (VLM) to process the images of documents, all the information in the original document is retained and utilized, avoiding the possible information loss introduced in the traditional parsing process. Specifically, the main contributions of VisRAG include: 1. **Retention of visual information**: Different from traditional text - parsing methods, VisRAG directly uses the images of documents as input, avoiding information loss caused by errors in the parsing process. 2. **Multimodal document processing**: VisRAG can process multimodal documents containing text and images, improving the ability to handle complex documents. 3. **Performance improvement**: Experimental results show that VisRAG outperforms traditional text - based RAG systems in both the retrieval and generation stages, with an end - to - end performance improvement of 25% to 39%. 4. **Data efficiency and generalization ability**: Further analysis shows that VisRAG performs better in terms of using training data and generalization ability and can maintain robustness in different types of documents. Through these improvements, VisRAG is expected to become the standard for the next - generation RAG systems, especially when dealing with multimodal documents.

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Retrieval-Augmented Generation for Natural Language Processing: A Survey

Retrieval-Augmented Generation for Large Language Models: A Survey

Self-adaptive Multimodal Retrieval-Augmented Generation

DRAGIN: Dynamic Retrieval Augmented Generation based on the Information Needs of Large Language Models

R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

LightRAG: Simple and Fast Retrieval-Augmented Generation

RAVEN: Multitask Retrieval Augmented Vision-Language Learning

DRAGIN: Dynamic Retrieval Augmented Generation Based on the Real-time Information Needs of Large Language Models.

Retrieval-augmented generation in multilingual settings

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Searching for Best Practices in Retrieval-Augmented Generation

AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

DRAGIN: Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Retrieval-Augmented Generation for AI-Generated Content: A Survey