BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Wenbo Hu,Yifan Xu,Yi Li,Weiyue Li,Zeyuan Chen,Zhuowen Tu
2023-12-18
Abstract:Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved 17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. Our code and models are freely accessible at <a class="link-external link-https" href="https://github.com/mlpc-ucsd/BLIVA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper mainly addresses the following issues: ### Main Issues - **Understanding and processing text-rich images**: Existing Vision-Language Models (VLMs) have limitations when dealing with images containing a large amount of textual information, especially in extracting and understanding text details within the images. ### Solution Overview - **Proposing the BLIV A model**: This is an enhanced version of the InstructBLIP model, which improves the understanding of text-rich images by combining query embeddings and encoded patch embeddings. ### Specific Contributions 1. **Model Design**: BLIV A utilizes query embeddings from InstructBLIP and introduces an additional visual auxiliary branch that directly projects encoded patch embeddings into a Large Language Model (LLM). This design helps capture complex details that might be missed during the query decoding process. 2. **Experimental Results**: The paper reports significant improvements of BLIV A in handling text-rich image question-answering tasks (such as OCR-VQA benchmarks), with improvements up to 17.76%. Additionally, there are enhancements in general image question-answering tasks (not particularly text-rich) with improvements up to 7.9%. In comprehensive multimodal LLM benchmarks (MME), the overall performance increased by 17.72%. 3. **Industry Application Demonstration**: To demonstrate the broad applicability of BLIV A in real-world scenarios, the researchers evaluated it using a new dataset containing YouTube thumbnails and their related question-answer pairs. ### Method Details - **Two-Stage Training Scheme**: - Pre-training Stage: Achieving initial alignment between visual and language modalities through image-text pair pre-training. - Instruction Fine-tuning Stage: Further refining the alignment of visual embeddings with the LLM using instruction fine-tuning data, enabling the model to better understand language-instructed visual inputs. - **Handling of Patch Embeddings**: The visual auxiliary branch mentioned in the paper differs from BLIP-2 by using a more compact pre-training dataset (0.5M pairs vs. 129M), which helps achieve more efficient alignment between the visual encoder and LLM in the first stage. ### Conclusion In summary, BLIV A aims to overcome the limitations of existing models in handling text-rich images by integrating query embeddings and encoded patch embeddings. Experiments have shown that this approach exhibits significant advantages in various image question-answering tasks.