PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding

Dawei Dai,Yuanhui Zhang,Long Xu,Qianlan Yang,Xiaojing Shen,Shuyin Xia,Guoyin Wang
2024-08-19
Abstract:The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies has demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large language-vision assistant (PA-LLaVA) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder for pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PA-LLaVA, first stage for domain alignment, and second stage for end to end visual question \& answering (VQA) task. In experiments, we evaluate our PA-LLaVA on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PA-LLaVA model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: <a class="link-external link-https" href="https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA" rel="external noopener nofollow">this https URL</a>
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a specialized large - scale language - vision assistant (PA - LLaVA) in the field of pathological image understanding. Specifically, the paper aims to: 1. **Construct a high - quality human pathological image - text dataset**: By cleaning public medical image - text data, an alignment dataset in a specific domain is constructed to train the model to better understand and describe pathological images. 2. **Develop a pathological language - image pre - training (PLIP) model**: Use the constructed dataset to train a specialized visual encoder to improve the representational ability of pathological image features, and design a scale - invariant connector to avoid information loss caused by image scaling. 3. **Adopt a two - stage learning method to train PA - LLaVA**: In the first stage, domain alignment is carried out, and in the second stage, end - to - end visual question - answering task training is carried out, enabling the model to answer various questions related to pathological images more accurately. Through these methods, the paper hopes to improve the performance of the model in supervised and zero - shot visual question - answering tasks and promote the development of computational pathology research.