Abstract:In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contemporary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs "see" the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs' alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.

Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering

Overcoming language priors with self-contrastive learning for visual question answering

Task Progressive Curriculum Learning for Robust Visual Question Answering

Cross-Modal Alternating Learning with Task-Aware Representations for Continual Learning

Multitask Learning for Visual Question Answering

Recent Advances of Continual Learning in Computer Vision: An Overview

Learning from Lexical Perturbations for Consistent Visual Question Answering

Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge Distillation

Proper Reuse of Features Extractor for Real-time Continual Learning

Compositional Memory for Visual Question Answering

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

Psycholinguistics meets Continual Learning: Measuring Catastrophic Forgetting in Visual Question Answering

Selectively Answering Visual Questions

Continual Pre-Training Mitigates Forgetting in Language and Vision

Right this way: Can VLMs Guide Us to See More to Answer Questions?

Improving Multimodal Large Language Models Using Continual Learning

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks