VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu,Chaoyue Tang,Bokai Xu,Junbo Cui,Junhao Ran,Yukun Yan,Zhenghao Liu,Shuo Wang,Xu Han,Zhiyuan Liu,Maosong Sun
2024-10-14
Abstract:Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25--39\% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at <a class="link-external link-https" href="https://github.com/openbmb/visrag" rel="external noopener nofollow">this https URL</a> .
Information Retrieval,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use visual information for Retrieval - Augmented Generation (RAG) in multimodal documents. Traditional RAG systems are mainly text - based and cannot fully utilize visual information such as layout and images in documents, which is an important limitation in practical applications. This article introduces a method named VisRAG. By using a Vision - Language Model (VLM) to process the images of documents, all the information in the original document is retained and utilized, avoiding the possible information loss introduced in the traditional parsing process. Specifically, the main contributions of VisRAG include: 1. **Retention of visual information**: Different from traditional text - parsing methods, VisRAG directly uses the images of documents as input, avoiding information loss caused by errors in the parsing process. 2. **Multimodal document processing**: VisRAG can process multimodal documents containing text and images, improving the ability to handle complex documents. 3. **Performance improvement**: Experimental results show that VisRAG outperforms traditional text - based RAG systems in both the retrieval and generation stages, with an end - to - end performance improvement of 25% to 39%. 4. **Data efficiency and generalization ability**: Further analysis shows that VisRAG performs better in terms of using training data and generalization ability and can maintain robustness in different types of documents. Through these improvements, VisRAG is expected to become the standard for the next - generation RAG systems, especially when dealing with multimodal documents.