Abstract:Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at <a class="link-external link-https" href="https://github.com/sammarfy/VLM4Bio" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to evaluate the performance of pre-trained Vision-Language Models (VLMs) in the field of biology, specifically whether these models can help scientists answer a range of questions related to biological feature discovery without additional fine-tuning. Specifically, the paper focuses on the following aspects: 1. **Species Classification**: Identifying the scientific name of the organism in the image. 2. **Feature Recognition**: Determining whether specific features (such as eyes, fins, etc.) of the organism are present in the image. 3. **Feature Localization**: Marking the location of specific features in the image. 4. **Feature Referencing**: Given a region, identifying the names of the features within that region. 5. **Feature Counting**: Counting the number of specific features in the image. ### Main Contributions 1. **Dataset Construction**: Constructed a benchmark dataset named VLM4Bio, containing approximately 469,000 question-answer pairs involving 30,000 images from fish, birds, and butterflies. 2. **Performance Evaluation**: Evaluated the zero-shot performance of 12 state-of-the-art VLMs on the above five tasks, revealing the strengths and weaknesses of these models in the field of biology. 3. **Prompting Techniques and Reasoning Illusion Tests**: Investigated the impact of different prompting techniques and reasoning illusion tests on the performance of VLMs, further exploring the reasoning capabilities of these models in the field of biology. ### Background and Motivation With the accumulation of a large number of biological images, these images provide new opportunities to accelerate biodiversity research. However, traditional methods of biological feature measurement rely on expert visual attention, which is cumbersome and subjectively defined, limiting the progress of scientific research. Pre-trained VLMs, due to their ability to handle both text and images simultaneously, may play an important role in biological feature discovery. Therefore, by constructing the VLM4Bio dataset and systematically evaluating the performance of VLMs, the paper aims to explore the practical application potential of these models in the field of biology.

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

Vision-Language Models for Vision Tasks: A Survey

Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

Zero-shot animal behavior classification with vision-language foundation models

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Understanding the World's Museums through Vision-Language Reasoning

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Medical Vision-Language Pre-Training for Brain Abnormalities

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

ViLMedic: a framework for research at the intersection of vision and language in medical AI