Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

Prateek Verma,Minh-Hao Van,Xintao Wu
2024-05-02
Abstract:Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Main Issues Addressed by the Paper This paper explores the potential and limitations of large Vision Language Models (VLMs) in the analysis of microscopic images. Specifically, the study focuses on the following aspects: 1. **Model Performance Evaluation**: The paper evaluates the performance of several state-of-the-art VLMs (such as ChatGPT-4, Gemini, and LLaVA) as well as Meta AI's general segmentation model, Segment Anything Model (SAM), in handling microscopic images. These models are tested on tasks such as classification, segmentation, counting, and visual question answering (VQA). 2. **Interdisciplinary Applications**: The study emphasizes the potential value of these models in fields such as medicine, biology, and materials science, particularly when analyzing microscopic images in conjunction with textual information. 3. **Task Difficulty and Challenges**: The paper points out that although these models exhibit certain capabilities, they still face challenges when dealing with images containing impurities, defects, overlapping objects, and rich diversity, performing far below the level of domain experts. 4. **Model Advantages and Limitations**: By analyzing the performance of the models on different tasks, the paper reveals their strengths and weaknesses, such as the performance of ChatGPT and Gemini in understanding microscopic image features, and SAM's general ability to isolate objects in images. 5. **Practical Application Scenarios**: The paper validates the models' performance in actual scientific tasks through specific experimental designs and datasets (such as NFFA and BBBC005), and discusses their adaptability and practicality in different scientific fields. Through the above research, the paper aims to provide insights for the further development of VLMs in the field of scientific image analysis, while highlighting the challenges faced by current models and pointing out directions for future improvements.