Abstract:Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode ``local hints'' and ``global contexts'' into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at \textit{\color{magenta}{\url{<a class="link-external link-https" href="https://github.com/LUNAProject22/RPA" rel="external noopener nofollow">this https URL</a>}}}.

What problem does this paper attempt to address?

The paper attempts to address the challenges in Visual Abductive Reasoning (VAR). Specifically, the authors aim to improve the performance of existing visual foundation models (such as CLIP) on VAR tasks by proposing a new method called Region Conditioned Adaptation (RCA). VAR tasks require the model to infer the most likely explanation or hypothesis based on a given visual observation (usually a specific region in an image). For example, seeing a "glass bottle" and the surrounding scene (such as a "restaurant" and "waiter") in an image, the model needs to infer that "this woman recently ordered a drink." ### Main Contributions: 1. **Region Conditioned Adaptation (RCA)**: RCA is a hybrid parameter-efficient fine-tuning method that enables the frozen CLIP model to reason based on local visual cues by adding a small number of trainable adapter parameters. 2. **Fine-Grained Region Prompts**: A new visual prompt is designed to encode fine-grained regional information in the CLIP model, emphasizing local evidence to enhance visual abductive reasoning capabilities. 3. **Enhanced Adapter+Tuning**: A new Map Adapter is introduced, which adjusts the attention map by additional query and key projection weights, further optimizing the adapter module. 4. **Dual-Contrastive Loss**: By jointly optimizing the contrastive loss from visual to evidence and visual to reasoning, the model learns the causal relationship between hypotheses and observations, thereby improving visual abductive reasoning performance. ### Method Overview: - **Regional Prompt Generator (RPG)**: Generates three detailed prompts focusing on specific regions, surrounding context, and existing visual prompts. - **Adapter+Tuning**: Adjusts the model's attention map and feature representation by adding new adapter modules to the frozen CLIP model. - **Dual-Contrastive Loss**: Jointly optimizes the contrastive loss from visual to evidence and visual to reasoning, enabling the model to better learn the causal relationship between hypotheses and observations. ### Experimental Results: - In the Sherlock benchmark, RCA significantly outperforms existing SOTA methods, particularly in the "Human Accuracy" (Human Acc) metric, achieving 31.74%, which is much higher than other methods (such as CPT-CLIP's 29.58%). - RCA excels in visual reasoning tasks with fine-grained regional evidence as prompts, further validating its effectiveness and robustness. In summary, the paper significantly improves the performance of visual abductive reasoning tasks by proposing the RCA method, providing new directions for future research.

RCA: Region Conditioned Adaptation for Visual Abductive Reasoning

A Cognitively-Inspired Neural Architecture for Visual Abstract Reasoning Using Contrastive Perceptual and Conceptual Processing

RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

Rethinking Visual Counterfactual Explanations Through Region Constraint

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Systematic Visual Reasoning through Object-Centric Relational Abstraction

RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition

Joint Answering and Explanation for Visual Commonsense Reasoning

Region-adaptive Concept Aggregation for Few-shot Visual Recognition

Abstract Visual Reasoning Enabled by Language

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Abstract Spatial-Temporal Reasoning Via Probabilistic Abduction and Execution

Towards Learning Abductive Reasoning using VSA Distributed Representations

Learning to reason over visual objects

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

RAVEN: A Dataset for Relational and Analogical Visual rEasoNing