Abstract:Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.

What problem does this paper attempt to address?

### Main Issues Addressed by the Paper This paper explores the potential and limitations of large Vision Language Models (VLMs) in the analysis of microscopic images. Specifically, the study focuses on the following aspects: 1. **Model Performance Evaluation**: The paper evaluates the performance of several state-of-the-art VLMs (such as ChatGPT-4, Gemini, and LLaVA) as well as Meta AI's general segmentation model, Segment Anything Model (SAM), in handling microscopic images. These models are tested on tasks such as classification, segmentation, counting, and visual question answering (VQA). 2. **Interdisciplinary Applications**: The study emphasizes the potential value of these models in fields such as medicine, biology, and materials science, particularly when analyzing microscopic images in conjunction with textual information. 3. **Task Difficulty and Challenges**: The paper points out that although these models exhibit certain capabilities, they still face challenges when dealing with images containing impurities, defects, overlapping objects, and rich diversity, performing far below the level of domain experts. 4. **Model Advantages and Limitations**: By analyzing the performance of the models on different tasks, the paper reveals their strengths and weaknesses, such as the performance of ChatGPT and Gemini in understanding microscopic image features, and SAM's general ability to isolate objects in images. 5. **Practical Application Scenarios**: The paper validates the models' performance in actual scientific tasks through specific experimental designs and datasets (such as NFFA and BBBC005), and discusses their adaptability and practicality in different scientific fields. Through the above research, the paper aims to provide insights for the further development of VLMs in the field of scientific image analysis, while highlighting the challenges faced by current models and pointing out directions for future improvements.

Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study

Vision-Language Models in Remote Sensing: Current progress and future trends

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Are VLMs Really Blind

Effectiveness Assessment of Recent Large Vision-Language Models

Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Vision language models are blind

An Introduction to Vision-Language Modeling

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

HumanVLM: Foundation for Human-Scene Vision-Language Model

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

A Survey of Medical Vision-and-Language Applications and Their Techniques

Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports