VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

M. Maruf,Arka Daw,Kazi Sajeed Mehrab,Harish Babu Manogaran,Abhilash Neog,Medha Sawhney,Mridul Khurana,James P. Balhoff,Yasin Bakis,Bahadir Altintas,Matthew J. Thompson,Elizabeth G. Campolongo,Josef C. Uyeda,Hilmar Lapp,Henry L. Bart,Paula M. Mabee,Yu Su,Wei-Lun Chao,Charles Stewart,Tanya Berger-Wolf,Wasila Dahdul,Anuj Karpatne
2024-08-29
Abstract:Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at <a class="link-external link-https" href="https://github.com/sammarfy/VLM4Bio" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate the performance of pre-trained Vision-Language Models (VLMs) in the field of biology, specifically whether these models can help scientists answer a range of questions related to biological feature discovery without additional fine-tuning. Specifically, the paper focuses on the following aspects: 1. **Species Classification**: Identifying the scientific name of the organism in the image. 2. **Feature Recognition**: Determining whether specific features (such as eyes, fins, etc.) of the organism are present in the image. 3. **Feature Localization**: Marking the location of specific features in the image. 4. **Feature Referencing**: Given a region, identifying the names of the features within that region. 5. **Feature Counting**: Counting the number of specific features in the image. ### Main Contributions 1. **Dataset Construction**: Constructed a benchmark dataset named VLM4Bio, containing approximately 469,000 question-answer pairs involving 30,000 images from fish, birds, and butterflies. 2. **Performance Evaluation**: Evaluated the zero-shot performance of 12 state-of-the-art VLMs on the above five tasks, revealing the strengths and weaknesses of these models in the field of biology. 3. **Prompting Techniques and Reasoning Illusion Tests**: Investigated the impact of different prompting techniques and reasoning illusion tests on the performance of VLMs, further exploring the reasoning capabilities of these models in the field of biology. ### Background and Motivation With the accumulation of a large number of biological images, these images provide new opportunities to accelerate biodiversity research. However, traditional methods of biological feature measurement rely on expert visual attention, which is cumbersome and subjectively defined, limiting the progress of scientific research. Pre-trained VLMs, due to their ability to handle both text and images simultaneously, may play an important role in biological feature discovery. Therefore, by constructing the VLM4Bio dataset and systematically evaluating the performance of VLMs, the paper aims to explore the practical application potential of these models in the field of biology.