Integrating Query-aware Segmentation and Cross-Attention for Robust VQA

Wonjun Choi,Sangbeom Lee,Seungyeon Lee,Heechul Jung,Dong-Gyu Lee
2024-07-09
Abstract:This paper introduces a method for VizWiz-VQA using LVLM with trainable cross-attention and LoRA finetuning. We train the model with the following conditions: 1) Training with original images. 2) Training with enhanced images using CLIPSeg to highlight or contrast the original image. 3) Training with integrating the output features of Vision Transformer (ViT) and CLIPSeg features of the original images. Then, we ensemble the results based on Levenshtein distance to enhance the prediction of the final answer. In the experiments, we demonstrate and analyze the proposed method's effectiveness.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the key problems in the Visual Question Answering (VQA) task, especially how to improve the model's ability to understand the complex interactions between image details and question contexts. Specifically, the paper focuses on the following points: 1. **Enhancing the effectiveness of VQA models**: Traditional VQA methods usually rely on direct mappings between visual content and question - answer pairs, ignoring the subtle interactions between specific details of the image and the question context. This leads to insufficient performance of the model when dealing with complex questions. 2. **Focusing on relevant parts of the image**: In order to better understand the image, the model needs to focus on the parts of the image related to the question's intention. This is especially important on the VizWiz - VQA dataset for blind users. VizWiz - VQA focuses on answering visual questions from the blind, so the model needs to be able to accurately identify and understand the parts of the image related to the question. 3. **Parameter - efficient fine - tuning methods**: In order to improve performance without significantly increasing the model's parameters, the paper introduces Low - rank Adaptation (LoRA) and the cross - attention mechanism to adjust large visual - language models (LVLMs) more efficiently. 4. **Combining image segmentation techniques**: To further improve visual understanding ability, the paper introduces CLIPSeg, an image segmentation technique based on text queries. By dynamically isolating relevant objects, CLIPSeg can enhance image processing, enabling the model to better focus on specific areas in the image. 5. **Integrating prediction results of different models**: To further improve the accuracy of the final prediction, the paper proposes an integration method based on Levenshtein distance. By comparing the outputs of different models and selecting the result closest to the correct answer, the final answer selection is optimized. In summary, the main goal of this paper is to improve the performance of VQA models in understanding and answering visual questions, especially on the VizWiz - VQA dataset, by introducing trainable cross - attention mechanisms, LoRA fine - tuning, and query - aware image segmentation techniques.