Abstract:Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved 17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. Our code and models are freely accessible at <a class="link-external link-https" href="https://github.com/mlpc-ucsd/BLIVA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper mainly addresses the following issues: ### Main Issues - **Understanding and processing text-rich images**: Existing Vision-Language Models (VLMs) have limitations when dealing with images containing a large amount of textual information, especially in extracting and understanding text details within the images. ### Solution Overview - **Proposing the BLIV A model**: This is an enhanced version of the InstructBLIP model, which improves the understanding of text-rich images by combining query embeddings and encoded patch embeddings. ### Specific Contributions 1. **Model Design**: BLIV A utilizes query embeddings from InstructBLIP and introduces an additional visual auxiliary branch that directly projects encoded patch embeddings into a Large Language Model (LLM). This design helps capture complex details that might be missed during the query decoding process. 2. **Experimental Results**: The paper reports significant improvements of BLIV A in handling text-rich image question-answering tasks (such as OCR-VQA benchmarks), with improvements up to 17.76%. Additionally, there are enhancements in general image question-answering tasks (not particularly text-rich) with improvements up to 7.9%. In comprehensive multimodal LLM benchmarks (MME), the overall performance increased by 17.72%. 3. **Industry Application Demonstration**: To demonstrate the broad applicability of BLIV A in real-world scenarios, the researchers evaluated it using a new dataset containing YouTube thumbnails and their related question-answer pairs. ### Method Details - **Two-Stage Training Scheme**: - Pre-training Stage: Achieving initial alignment between visual and language modalities through image-text pair pre-training. - Instruction Fine-tuning Stage: Further refining the alignment of visual embeddings with the LLM using instruction fine-tuning data, enabling the model to better understand language-instructed visual inputs. - **Handling of Patch Embeddings**: The visual auxiliary branch mentioned in the paper differs from BLIP-2 by using a more compact pre-training dataset (0.5M pairs vs. 129M), which helps achieve more efficient alignment between the visual encoder and LLM in the first stage. ### Conclusion In summary, BLIV A aims to overcome the limitations of existing models in handling text-rich images by integrating query embeddings and encoded patch embeddings. Experiments have shown that this approach exhibits significant advantages in various image question-answering tasks.

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments

InfMLLM: A Unified Framework for Visual-Language Tasks.

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

LIVE: Learnable In-Context Vector for Visual Question Answering

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

Visually-Augmented Language Modeling

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions

LLMs Meet Long Video: Advancing Long Video Comprehension with an Interactive Visual Adapter in LLMs.