Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering

Nidhi Hegde,Sujoy Paul,Gagan Madan,Gaurav Aggarwal

2023-09-25

Abstract:Recent document question answering models consist of two key components: the vision encoder, which captures layout and visual elements in images, and a Large Language Model (LLM) that helps contextualize questions to the image and supplements them with external world knowledge to generate accurate answers. However, the relative contributions of the vision encoder and the language model in these tasks remain unclear. This is especially interesting given the effectiveness of instruction-tuned LLMs, which exhibit remarkable adaptability to new tasks. To this end, we explore the following aspects in this work: (1) The efficacy of an LLM-only approach on document question answering tasks (2) strategies for serializing textual information within document images and feeding it directly to an instruction-tuned LLM, thus bypassing the need for an explicit vision encoder (3) thorough quantitative analysis on the feasibility of such an approach. Our comprehensive analysis encompasses six diverse benchmark datasets, utilizing LLMs of varying scales. Our findings reveal that a strategy exclusively reliant on the LLM yields results that are on par with or closely approach state-of-the-art performance across a range of datasets. We posit that this evaluation framework will serve as a guiding resource for selecting appropriate datasets for future research endeavors that emphasize the fundamental importance of layout and image content information.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily explores the effectiveness of using only large language models (LLMs) to handle image-based document question-answering tasks. Specifically, the paper aims to address the following issues: 1. **Evaluating the performance of LLMs in document question-answering tasks**: Investigating whether it is possible to effectively complete document question-answering tasks without using visual encoders, relying solely on language models (LLMs). 2. **Text serialization strategy**: Exploring how to serialize text information from document images and directly input it into instruction-tuned large language models, thereby bypassing the need for explicit visual encoders. 3. **Quantitative analysis of feasibility**: Conducting a comprehensive quantitative analysis of this LLM-only approach, verifying its performance across various datasets, and comparing it with existing methods. Through these studies, the paper hopes to reveal the potential of methods that rely solely on language models in document question-answering tasks and provide a valuable benchmark framework for future research.

Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Drilling Down into the Discourse Structure with LLMs for Long Document Question Answering

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset

Investigating Answerability of LLMs for Long-Form Question Answering

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

DocLLM: A layout-aware generative language model for multimodal document understanding

Understanding Information Storage and Transfer in Multi-modal Large Language Models

LLM Augmentations to support Analytical Reasoning over Multiple Documents

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

VideoQA in the Era of LLMs: An Empirical Study

Right this way: Can VLMs Guide Us to See More to Answer Questions?

Rethinking VLMs and LLMs for Image Classification

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering

Large language models for automated Q&A involving legal documents: a survey on algorithms, frameworks and applications

Evaluation Methodology for Large Language Models for Multilingual Document Question and Answer

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks