INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

Edward Vendrow,Omiros Pantazis,Alexander Shepard,Gabriel Brostow,Kate E. Jones,Oisin Mac Aodha,Sara Beery,Grant Van Horn
2024-11-05
Abstract:We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at <a class="link-external link-https" href="https://inquire-benchmark.github.io" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The problem this paper attempts to address is that existing image retrieval benchmark datasets are insufficient for evaluating the performance of multimodal vision-language models on expert-level queries, particularly in terms of fine-grained and complex reasoning capabilities in natural world image retrieval. Specifically, existing image retrieval datasets typically contain simple and common everyday category queries, which are no longer challenging for state-of-the-art multimodal models. Therefore, the paper proposes a new benchmark dataset, **INQUIRE**, aimed at filling this gap. **INQUIRE** has the following features: 1. **Large-scale dataset**: Contains 5 million images from the natural world, sourced from the iNaturalist platform, covering 10,000 species. 2. **Expert-level queries**: Includes 250 expert-level retrieval queries that encompass various concepts in ecology and biodiversity research, such as species identification, behavior, appearance, and background. 3. **Comprehensive annotations**: Each query is comprehensively annotated with all relevant images, totaling 33,000 matching relationships. 4. **Two core tasks**: **INQUIRE-FULLRANK** (full dataset ranking task) and **INQUIRE-RERANK** (re-ranking task), which respectively evaluate the model's retrieval capabilities on a large-scale dataset and its ability to optimize initial retrieval results. Through these designs, the **INQUIRE** benchmark aims to advance multimodal models in handling complex, fine-grained natural world image retrieval tasks, thereby accelerating research in ecology and biodiversity.