RCA: Region Conditioned Adaptation for Visual Abductive Reasoning

Hao Zhang,Yeo Keat Ee,Basura Fernando
2024-08-07
Abstract:Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode ``local hints'' and ``global contexts'' into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at \textit{\color{magenta}{\url{<a class="link-external link-https" href="https://github.com/LUNAProject22/RPA" rel="external noopener nofollow">this https URL</a>}}}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the challenges in Visual Abductive Reasoning (VAR). Specifically, the authors aim to improve the performance of existing visual foundation models (such as CLIP) on VAR tasks by proposing a new method called Region Conditioned Adaptation (RCA). VAR tasks require the model to infer the most likely explanation or hypothesis based on a given visual observation (usually a specific region in an image). For example, seeing a "glass bottle" and the surrounding scene (such as a "restaurant" and "waiter") in an image, the model needs to infer that "this woman recently ordered a drink." ### Main Contributions: 1. **Region Conditioned Adaptation (RCA)**: RCA is a hybrid parameter-efficient fine-tuning method that enables the frozen CLIP model to reason based on local visual cues by adding a small number of trainable adapter parameters. 2. **Fine-Grained Region Prompts**: A new visual prompt is designed to encode fine-grained regional information in the CLIP model, emphasizing local evidence to enhance visual abductive reasoning capabilities. 3. **Enhanced Adapter+Tuning**: A new Map Adapter is introduced, which adjusts the attention map by additional query and key projection weights, further optimizing the adapter module. 4. **Dual-Contrastive Loss**: By jointly optimizing the contrastive loss from visual to evidence and visual to reasoning, the model learns the causal relationship between hypotheses and observations, thereby improving visual abductive reasoning performance. ### Method Overview: - **Regional Prompt Generator (RPG)**: Generates three detailed prompts focusing on specific regions, surrounding context, and existing visual prompts. - **Adapter+Tuning**: Adjusts the model's attention map and feature representation by adding new adapter modules to the frozen CLIP model. - **Dual-Contrastive Loss**: Jointly optimizes the contrastive loss from visual to evidence and visual to reasoning, enabling the model to better learn the causal relationship between hypotheses and observations. ### Experimental Results: - In the Sherlock benchmark, RCA significantly outperforms existing SOTA methods, particularly in the "Human Accuracy" (Human Acc) metric, achieving 31.74%, which is much higher than other methods (such as CPT-CLIP's 29.58%). - RCA excels in visual reasoning tasks with fine-grained regional evidence as prompts, further validating its effectiveness and robustness. In summary, the paper significantly improves the performance of visual abductive reasoning tasks by proposing the RCA method, providing new directions for future research.