Abstract:Current vision-language models (VLMs) still exhibit inferior performance on knowledge-intensive tasks, primarily due to the challenge of accurately encoding all the associations between visual objects and scenes to their corresponding entities and background knowledge. While retrieval augmentation methods offer an efficient way to integrate external knowledge, extending them to vision-language domain presents unique challenges in (1) precisely retrieving relevant information from external sources due to the inherent discrepancy within the multimodal queries, and (2) being resilient to the irrelevant, extraneous and noisy information contained in the retrieved multimodal knowledge snippets. In this work, we introduce RORA-VLM, a novel and robust retrieval augmentation framework specifically tailored for VLMs, with two key innovations: (1) a 2-stage retrieval process with image-anchored textual-query expansion to synergistically combine the visual and textual information in the query and retrieve the most relevant multimodal knowledge snippets; and (2) a robust retrieval augmentation method that strengthens the resilience of VLMs against irrelevant information in the retrieved multimodal knowledge by injecting adversarial noises into the retrieval-augmented training process, and filters out extraneous visual information, such as unrelated entities presented in images, via a query-oriented visual token refinement strategy. We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets. Our results demonstrate that with a minimal amount of training instance, RORA-VLM enables the base model to achieve significant performance improvement and constantly outperform state-of-the-art retrieval-augmented VLMs on all benchmarks while also exhibiting a novel zero-shot domain transfer capability.

Improving Language Estimation with the Paragraph Vector Model for Ad-Hoc Retrieval

Analysis of the Paragraph Vector Model for Information Retrieval.

A Neural Passage Model for Ad-hoc Document Retrieval.

Recurrent Neural Network Language Model Adaptation Derived Document Vector

Nonparametric Topic Modeling with Neural Inference

LDA-Based Retrieval Framework for Semantic News Video Retrieval

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

In-Context Retrieval-Augmented Language Models

More Room for Language: Investigating the Effect of Retrieval on Language Models

Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval

Fully utilize feedbacks: language model based relevance feedback in information retrieval

RoRA-VLM: Robust Retrieval-Augmented Vision Language Models

Efficient information retrieval based on a combination of vector space and probabilistic models

PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval

A New Document Retrieval Model Using Dempster-Shafer Theory of Evidence

Understanding the Multi-vector Dense Retrieval Models

Reliable, Adaptable, and Attributable Language Models with Retrieval

EASE-DR: Enhanced Sentence Embeddings for Dense Retrieval

Making Retrieval-Augmented Language Models Robust to Irrelevant Context

Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval