Abstract:Current vision-language models (VLMs) still exhibit inferior performance on knowledge-intensive tasks, primarily due to the challenge of accurately encoding all the associations between visual objects and scenes to their corresponding entities and background knowledge. While retrieval augmentation methods offer an efficient way to integrate external knowledge, extending them to vision-language domain presents unique challenges in (1) precisely retrieving relevant information from external sources due to the inherent discrepancy within the multimodal queries, and (2) being resilient to the irrelevant, extraneous and noisy information contained in the retrieved multimodal knowledge snippets. In this work, we introduce RORA-VLM, a novel and robust retrieval augmentation framework specifically tailored for VLMs, with two key innovations: (1) a 2-stage retrieval process with image-anchored textual-query expansion to synergistically combine the visual and textual information in the query and retrieve the most relevant multimodal knowledge snippets; and (2) a robust retrieval augmentation method that strengthens the resilience of VLMs against irrelevant information in the retrieved multimodal knowledge by injecting adversarial noises into the retrieval-augmented training process, and filters out extraneous visual information, such as unrelated entities presented in images, via a query-oriented visual token refinement strategy. We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets. Our results demonstrate that with a minimal amount of training instance, RORA-VLM enables the base model to achieve significant performance improvement and constantly outperform state-of-the-art retrieval-augmented VLMs on all benchmarks while also exhibiting a novel zero-shot domain transfer capability.

Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models

Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-time Adaptation for Vision-Language Models

Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval

Adaptively Building a Video-language Model for Video Captioning and Retrieval Without Massive Video Pretraining

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models

Rethinking Overlooked Aspects in Vision-Language Models

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

A survey of efficient fine-tuning methods for Vision-Language Models — Prompt and Adapter

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

A-VL: Adaptive Attention for Large Vision-Language Models

RoRA-VLM: Robust Retrieval-Augmented Vision Language Models

Class Incremental Learning with Pre-trained Vision-Language Models

Towards Difficulty-Agnostic Efficient Transfer Learning for Vision-Language Models