PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding

Dawei Dai,Yuanhui Zhang,Long Xu,Qianlan Yang,Xiaojing Shen,Shuyin Xia,Guoyin Wang

2024-08-19

Abstract:The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies has demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large language-vision assistant (PA-LLaVA) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder for pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PA-LLaVA, first stage for domain alignment, and second stage for end to end visual question \& answering (VQA) task. In experiments, we evaluate our PA-LLaVA on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PA-LLaVA model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: <a class="link-external link-https" href="https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA" rel="external noopener nofollow">this https URL</a>

Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop a specialized large - scale language - vision assistant (PA - LLaVA) in the field of pathological image understanding. Specifically, the paper aims to: 1. **Construct a high - quality human pathological image - text dataset**: By cleaning public medical image - text data, an alignment dataset in a specific domain is constructed to train the model to better understand and describe pathological images. 2. **Develop a pathological language - image pre - training (PLIP) model**: Use the constructed dataset to train a specialized visual encoder to improve the representational ability of pathological image features, and design a scale - invariant connector to avoid information loss caused by image scaling. 3. **Adopt a two - stage learning method to train PA - LLaVA**: In the first stage, domain alignment is carried out, and in the second stage, end - to - end visual question - answering task training is carried out, enabling the model to answer various questions related to pathological images more accurately. Through these methods, the paper hopes to improve the performance of the model in supervised and zero - shot visual question - answering tasks and promote the development of computational pathology research.

PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

LLaVA-Endo: a Large Language-and-vision Assistant for Gastrointestinal Endoscopy

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue

Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection

PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

Advancing High Resolution Vision-Language Models in Biomedicine

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis

XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration