Abstract:This paper introduces a method for VizWiz-VQA using LVLM with trainable cross-attention and LoRA finetuning. We train the model with the following conditions: 1) Training with original images. 2) Training with enhanced images using CLIPSeg to highlight or contrast the original image. 3) Training with integrating the output features of Vision Transformer (ViT) and CLIPSeg features of the original images. Then, we ensemble the results based on Levenshtein distance to enhance the prediction of the final answer. In the experiments, we demonstrate and analyze the proposed method's effectiveness.

What problem does this paper attempt to address?

This paper attempts to solve the key problems in the Visual Question Answering (VQA) task, especially how to improve the model's ability to understand the complex interactions between image details and question contexts. Specifically, the paper focuses on the following points: 1. **Enhancing the effectiveness of VQA models**: Traditional VQA methods usually rely on direct mappings between visual content and question - answer pairs, ignoring the subtle interactions between specific details of the image and the question context. This leads to insufficient performance of the model when dealing with complex questions. 2. **Focusing on relevant parts of the image**: In order to better understand the image, the model needs to focus on the parts of the image related to the question's intention. This is especially important on the VizWiz - VQA dataset for blind users. VizWiz - VQA focuses on answering visual questions from the blind, so the model needs to be able to accurately identify and understand the parts of the image related to the question. 3. **Parameter - efficient fine - tuning methods**: In order to improve performance without significantly increasing the model's parameters, the paper introduces Low - rank Adaptation (LoRA) and the cross - attention mechanism to adjust large visual - language models (LVLMs) more efficiently. 4. **Combining image segmentation techniques**: To further improve visual understanding ability, the paper introduces CLIPSeg, an image segmentation technique based on text queries. By dynamically isolating relevant objects, CLIPSeg can enhance image processing, enabling the model to better focus on specific areas in the image. 5. **Integrating prediction results of different models**: To further improve the accuracy of the final prediction, the paper proposes an integration method based on Levenshtein distance. By comparing the outputs of different models and selecting the result closest to the correct answer, the final answer selection is optimized. In summary, the main goal of this paper is to improve the performance of VQA models in understanding and answering visual questions, especially on the VizWiz - VQA dataset, by introducing trainable cross - attention mechanisms, LoRA fine - tuning, and query - aware image segmentation techniques.

Integrating Query-aware Segmentation and Cross-Attention for Robust VQA

Simple and Effective Visual Question Answering in a Single Modality

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images

Multi-source Multi-level Attention Networks for Visual Question Answering

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation

Task-driven Visual Saliency and Attention-based Visual Question Answering

Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

CLVIN: Complete language-vision interaction network for visual question answering

Language-Aware Vision Transformer for Referring Segmentation

A multimodal attention fusion network with a dynamic vocabulary for TextVQA

Feature Enhancement in Attention for Visual Question Answering.

Modular dual-stream visual fusion network for visual question answering

Multiscale Feature Extraction and Fusion of Image and Text in VQA

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation