Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications

Monica Riedler,Stefan Langer

2024-10-29

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in answering questions, but they lack domain-specific knowledge and are prone to hallucinations. Retrieval Augmented Generation (RAG) is one approach to address these challenges, while multimodal models are emerging as promising AI assistants for processing both text and images. In this paper we describe a series of experiments aimed at determining how to best integrate multimodal models into RAG systems for the industrial domain. The purpose of the experiments is to determine whether including images alongside text from documents within the industrial domain increases RAG performance and to find the optimal configuration for such a multimodal RAG system. Our experiments include two approaches for image processing and retrieval, as well as two LLMs (GPT4-Vision and LLaVA) for answer synthesis. These image processing strategies involve the use of multimodal embeddings and the generation of textual summaries from images. We evaluate our experiments with an LLM-as-a-Judge approach. Our results reveal that multimodal RAG can outperform single-modality RAG settings, although image retrieval poses a greater challenge than text retrieval. Additionally, leveraging textual summaries from images presents a more promising approach compared to the use of multimodal embeddings, providing more opportunities for future advancements.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the following issues: 1. **Performance comparison between unimodal and multimodal RAG systems**: Researchers aim to determine through experiments whether combining text and images can improve the performance of RAG (Retrieval Augmented Generation) systems in the industrial field. 2. **Optimization configuration of multimodal RAG systems**: Researchers aim to find the optimal configuration for multimodal RAG systems, particularly the best methods for image processing and retrieval. Specifically, the paper explores these goals through the following two main questions: 1. **Can combining text and images improve the performance of RAG systems in the industrial field?** 2. **How to optimize the performance of multimodal RAG systems in the industrial field?** To answer these questions, researchers designed a series of experiments, including unimodal (text-only or image-only) and multimodal (text and image) RAG systems, and used two different image processing strategies: multimodal embedding and image summarization. Additionally, they used two multimodal large language models (GPT-4-Vision and LLaVA) for answer synthesis and evaluated the experimental results through a large language model-based evaluation framework.

Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications

IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

Deploying Large Language Models With Retrieval Augmented Generation

Retrieval-Augmented Generation for Large Language Models: A Survey

MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text

Enhancing Multilingual Information Retrieval in Mixed Human Resources Environments: A RAG Model Implementation for Multicultural Enterprise

RAG based Question-Answering for Contextual Response Prediction System

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information Assistant

Towards Optimizing a Retrieval Augmented Generation using Large Language Model on Academic Data

Adopting RAG for LLM-Aided Future Vehicle Design

Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Faculty Perspectives on the Potential of RAG in Computer Science Higher Education

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

Eliciting Critical Reasoning in Retrieval-Augmented Language Models via Contrastive Explanations

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

ERATTA: Extreme RAG for Table To Answers with Large Language Models

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models