Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications

Monica Riedler,Stefan Langer
2024-10-29
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in answering questions, but they lack domain-specific knowledge and are prone to hallucinations. Retrieval Augmented Generation (RAG) is one approach to address these challenges, while multimodal models are emerging as promising AI assistants for processing both text and images. In this paper we describe a series of experiments aimed at determining how to best integrate multimodal models into RAG systems for the industrial domain. The purpose of the experiments is to determine whether including images alongside text from documents within the industrial domain increases RAG performance and to find the optimal configuration for such a multimodal RAG system. Our experiments include two approaches for image processing and retrieval, as well as two LLMs (GPT4-Vision and LLaVA) for answer synthesis. These image processing strategies involve the use of multimodal embeddings and the generation of textual summaries from images. We evaluate our experiments with an LLM-as-a-Judge approach. Our results reveal that multimodal RAG can outperform single-modality RAG settings, although image retrieval poses a greater challenge than text retrieval. Additionally, leveraging textual summaries from images presents a more promising approach compared to the use of multimodal embeddings, providing more opportunities for future advancements.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the following issues: 1. **Performance comparison between unimodal and multimodal RAG systems**: Researchers aim to determine through experiments whether combining text and images can improve the performance of RAG (Retrieval Augmented Generation) systems in the industrial field. 2. **Optimization configuration of multimodal RAG systems**: Researchers aim to find the optimal configuration for multimodal RAG systems, particularly the best methods for image processing and retrieval. Specifically, the paper explores these goals through the following two main questions: 1. **Can combining text and images improve the performance of RAG systems in the industrial field?** 2. **How to optimize the performance of multimodal RAG systems in the industrial field?** To answer these questions, researchers designed a series of experiments, including unimodal (text-only or image-only) and multimodal (text and image) RAG systems, and used two different image processing strategies: multimodal embedding and image summarization. Additionally, they used two multimodal large language models (GPT-4-Vision and LLaVA) for answer synthesis and evaluated the experimental results through a large language model-based evaluation framework.