INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

Edward Vendrow,Omiros Pantazis,Alexander Shepard,Gabriel Brostow,Kate E. Jones,Oisin Mac Aodha,Sara Beery,Grant Van Horn

2024-11-05

Abstract:We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at <a class="link-external link-https" href="https://inquire-benchmark.github.io" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Information Retrieval

What problem does this paper attempt to address?

The problem this paper attempts to address is that existing image retrieval benchmark datasets are insufficient for evaluating the performance of multimodal vision-language models on expert-level queries, particularly in terms of fine-grained and complex reasoning capabilities in natural world image retrieval. Specifically, existing image retrieval datasets typically contain simple and common everyday category queries, which are no longer challenging for state-of-the-art multimodal models. Therefore, the paper proposes a new benchmark dataset, **INQUIRE**, aimed at filling this gap. **INQUIRE** has the following features: 1. **Large-scale dataset**: Contains 5 million images from the natural world, sourced from the iNaturalist platform, covering 10,000 species. 2. **Expert-level queries**: Includes 250 expert-level retrieval queries that encompass various concepts in ecology and biodiversity research, such as species identification, behavior, appearance, and background. 3. **Comprehensive annotations**: Each query is comprehensively annotated with all relevant images, totaling 33,000 matching relationships. 4. **Two core tasks**: **INQUIRE-FULLRANK** (full dataset ranking task) and **INQUIRE-RERANK** (re-ranking task), which respectively evaluate the model's retrieval capabilities on a large-scale dataset and its ability to optimize initial retrieval results. Through these designs, the **INQUIRE** benchmark aims to advance multimodal models in handling complex, fine-grained natural world image retrieval tasks, thereby accelerating research in ecology and biodiversity.

INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

Benchmarking Representation Learning for Natural World Image Collections

Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Towards Complex-query Referring Image Segmentation: A Novel Benchmark

Image Classification Benchmark (ICB)

Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification

Investigating the Role of Image Retrieval for Visual Localization -- An exhaustive benchmark

Benchmarking Image Retrieval for Visual Localization

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios

Monocular Image-Based 3-D Model Retrieval: A Benchmark

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

Neural Naturalist: Generating Fine-Grained Image Comparisons

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models

Insect Identification in the Wild: The AMI Dataset

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models